Re: [dynahash] do not refill the hashkey after hash_search

2023-10-24 Thread John Naylor
On Wed, Oct 25, 2023 at 12:21 PM Tom Lane  wrote:
>
> John Naylor  writes:
> > I'd prefer just adding "Assert(hentry->event == oldn);" and declaring
> > hentry PG_USED_FOR_ASSERTS_ONLY.
>
> I'm not aware of any other places where we have Asserts checking
> that hash_search() honored its contract.  Why do we need one here?

[removing old CC]
The author pointed out here that we're not consistent in this regard:

https://www.postgresql.org/message-id/CAEG8a3KEO_Kdt2Y5hFNWMEX3DpCXi9jtZOJY-GFUEE9QLgF%2Bbw%40mail.gmail.com

...but I didn't try seeing where the balance lay. We can certainly
just remove redundant assignments.




Re: Guiding principle for dropping LLVM versions?

2023-10-24 Thread Thomas Munro
Here are some systematic rules I'd like to propose to anchor this
stuff to reality and avoid future doubt and litigation:

1.  Build farm animals testing LLVM determine the set of OSes and LLVM
versions we consider.
2.  We exclude OSes that will be out of full vendor support when a
release ships.
3.  We exclude OSes that don't bless an LLVM release (eg macOS running
an arbitrarily picked version), and animals running only to cover
ancient LLVM compiled from source for coverage (Andres's sid
menagerie).

By these rules we can't require LLVM 14 for another year, because
Ubuntu and Amazon Linux are standing in the way*:

animal |  arch   | llvm_version |   os   | os_release | end_of_support
---+-+--+++
 branta| s390x   | 10.0.0   | Ubuntu | 20.04  | 2025-04-01
 splitfin  | aarch64 | 10.0.0   | Ubuntu | 20.04  | 2025-04-01
 urutau| s390x   | 10.0.0   | Ubuntu | 20.04  | 2025-04-01
 massasauga| aarch64 | 11.1.0   | Amazon | 2  | 2025-06-30
 snakefly  | aarch64 | 11.1.0   | Amazon | 2  | 2025-06-30
 sarus | s390x   | 14.0.0   | Ubuntu | 22.04  | 2027-06-01
 shiner| aarch64 | 14.0.0   | Ubuntu | 22.04  | 2027-06-01
 turbot| aarch64 | 14.0.0   | Ubuntu | 22.04  | 2027-06-01
 lora  | s390x   | 15.0.7   | RHEL   | 9  | 2027-05-31
 mamushi   | s390x   | 15.0.7   | RHEL   | 9  | 2027-05-31
 nicator   | ppc64le | 15.0.7   | Alma   | 9  | 2027-05-31
 oystercatcher | aarch64 | 15.0.7   | Alma   | 9  | 2027-05-31

Ideally more distros would be present in this vacuum-horizon decision
table, but I don't think it'd change the conclusion: 10 is the
trailing edge.  Therefore the attached patch scales back its ambition
to that release.  Tested on LLVM 10-18.

If I pushed this we'd need to disable or upgrade the following to
avoid failure in configure on master:

   animal|arch| llvm_version |   os   | os_release
| end_of_support
-++--+++
 dragonet| x86_64 | 3.9.1| Debian | sid|
 phycodurus  | x86_64 | 3.9.1| Debian | sid|
 desmoxytes  | x86_64 | 4.0.1| Debian | sid|
 petalura| x86_64 | 4.0.1| Debian | sid|
 mantid  | x86_64 | 5.0.1| CentOS | 7
| 2019-08-06
 idiacanthus | x86_64 | 5.0.2| Debian | sid|
 pogona  | x86_64 | 5.0.2| Debian | sid|
 cotinga | s390x  | 6.0.0| Ubuntu | 18.04
| 2023-06-01
 vimba   | aarch64| 6.0.0| Ubuntu | 18.04
| 2023-06-01
 komodoensis | x86_64 | 6.0.1| Debian | sid|
 topminnow   | mips64el; -mabi=32 | 6.0.1| Debian | 8
| 2018-06-17
 xenodermus  | x86_64 | 6.0.1| Debian | sid|
 alimoche| aarch64| 7.0.1| Debian | 10
| 2022-09-10
 blackneck   | aarch64| 7.0.1| Debian | 10
| 2022-09-10
 bonito  | ppc64le| 7.0.1| Fedora | 29
| 2019-11-26

*Some distros announce EOL date by month without saying which day, so
in my data collecting operation I just punched in the first of the
month, *shrug*
From 581b88879316065dab113d1512aeac7e9932b0af Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 19 Oct 2023 04:45:46 +1300
Subject: [PATCH v4] jit: Require at least LLVM 10.

Remove support for older LLVM versions.  The default on common software
distributions will be at least LLVM 10 when PostgreSQL 17 ships.

Discussion: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com
---
 config/llvm.m4  | 10 ++--
 configure   | 43 ++---
 doc/src/sgml/installation.sgml  |  4 +-
 meson.build |  2 +-
 src/backend/jit/llvm/llvmjit.c  | 53 ++---
 src/backend/jit/llvm/llvmjit_error.cpp  | 10 
 src/backend/jit/llvm/llvmjit_expr.c |  6 +--
 src/backend/jit/llvm/llvmjit_inline.cpp | 38 +--
 src/backend/jit/llvm/llvmjit_wrap.cpp   | 61 -
 src/include/jit/llvmjit.h   | 17 ---
 src/include/pg_config.h.in  | 12 -
 src/tools/msvc/Solution.pm  |  3 --
 12 files changed, 17 insertions(+), 242 deletions(-)

diff --git a/config/llvm.m4 b/config/llvm.m4
index 21d8cd4f90..44769d819a 100644
--- a/config/llvm.m4
+++ b/config/llvm.m4
@@ -13,7 +13,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT],
   AC_REQUIRE([AC_PROG_AWK])
 
   AC_ARG_VAR(LLVM_CONFIG, [path to llvm-config command])
-  PGAC_PATH_PROGS(LLVM_CONFIG, llvm-config llvm-config-7 llvm-config-6.0 llvm-config-5.0 

Re: [dynahash] do not refill the hashkey after hash_search

2023-10-24 Thread Tom Lane
John Naylor  writes:
> I'd prefer just adding "Assert(hentry->event == oldn);" and declaring
> hentry PG_USED_FOR_ASSERTS_ONLY.

I'm not aware of any other places where we have Asserts checking
that hash_search() honored its contract.  Why do we need one here?

regards, tom lane




Re: [dynahash] do not refill the hashkey after hash_search

2023-10-24 Thread John Naylor
On Fri, Oct 13, 2023 at 10:07 AM Nathan Bossart
 wrote:
>
> On Thu, Sep 14, 2023 at 04:28:26PM +0800, Junwang Zhao wrote:
> > Add a v2 with some change to fix warnings about unused-parameter.
> >
> > I will add this to Commit Fest.
>
> This looks reasonable to me.  I've marked the commitfest entry as
> ready-for-committer.  I will plan on committing it in a couple of days
> unless John has additional feedback or would like to do the honors.

(I've been offline for a few weeks, and have been catching up this week.)

I agree it's reasonable, but there are a couple small loose ends I'd
like to see addressed.

- strlcpy(hentry->name, name, sizeof(hentry->name));

This might do with a comment stating we already set the value, (we've
seen in this thread that some other code does this), but I don't feel
strongly about it.

  do
  {
- hash_search(uncommitted_enums, serialized++, HASH_ENTER, NULL);
+ (void) hash_search(uncommitted_enums, serialized++, HASH_ENTER, NULL);
  } while (OidIsValid(*serialized));

I still consider this an unrelated and unnecessary cosmetic change.

- NotificationHash *hentry;
  bool found;

- hentry = (NotificationHash *) hash_search(pendingNotifies->hashtab,
-   ,
-   HASH_ENTER,
-   );
+ (void) hash_search(pendingNotifies->hashtab,
+,
+HASH_ENTER,
+);
  Assert(!found);
- hentry->event = oldn;

I'd prefer just adding "Assert(hentry->event == oldn);" and declaring
hentry PG_USED_FOR_ASSERTS_ONLY.




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-24 Thread Dilip Kumar
On Tue, Oct 24, 2023 at 9:34 PM Alvaro Herrera  wrote:
>
> On 2023-Oct-11, Dilip Kumar wrote:
>
> > In my last email, I forgot to give the link from where I have taken
> > the base path for dividing the buffer pool in banks so giving the same
> > here[1].  And looking at this again it seems that the idea of that
> > patch was from Andrey M. Borodin and the idea of the SLRU scale factor
> > were introduced by Yura Sokolov and Ivan Lazarev.  Apologies for
> > missing that in the first email.
>
> You mean [1].
> [1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru
> I don't like this idea very much, because of the magic numbers that act
> as ratios for numbers of buffers on each SLRU compared to other SLRUs.
> These values, which I took from the documentation part of the patch,
> appear to have been selected by throwing darts at the wall:
>
> NUM_CLOG_BUFFERS= Min(128 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_COMMIT_TS_BUFFERS   = Min(128 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_SUBTRANS_BUFFERS= Min(64 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_NOTIFY_BUFFERS  = Min(32 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_SERIAL_BUFFERS  = Min(32 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, 
> shared_buffers/256)
> NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, 
> shared_buffers/256)
>
> ... which look pretty random already, if similar enough to the current
> hardcoded values.  In reality, the code implements different values than
> what the documentation says.
>
> I don't see why would CLOG have the same number as COMMIT_TS, when the
> size for elements of the latter is like 32 times bigger -- however, the
> frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
> SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
> same range.  The MULTIXACT ones appear to keep the current ratio among
> them (8/16 gets changed to 32/64).
>
> ... and this whole mess is scaled exponentially without regard to the
> size that each SLRU requires.  This is just betting that enough memory
> can be wasted across all SLRUs up to the point where the one that is
> actually contended has sufficient memory.  This doesn't sound sensible
> to me.
>
> Like everybody else, I like having less GUCs to configure, but going
> this far to avoid them looks rather disastrous to me.  IMO we should
> just use Munro's older patches that gave one GUC per SLRU, and users
> only need to increase the one that shows up in pg_wait_event sampling.
> Someday we will get the (much more complicated) patches to move these
> buffers to steal memory from shared buffers, and that'll hopefully let
> use get rid of all this complexity.

Overall I agree with your comments, actually, I haven't put that much
thought into the GUC part and how it scales the SLRU buffers w.r.t.
this single configurable parameter.  Yeah, so I think it is better
that we take the older patch version as our base patch where we have
separate GUC per SLRU.

> I'm inclined to use Borodin's patch last posted here [2] instead of your
> proposed 0001.
> [2] https://postgr.es/m/93236d36-b91c-4dfa-af03-99c083840...@yandex-team.ru

I will rebase my patches on top of this.

> I did skim patches 0002 and 0003 without going into too much detail;
> they look reasonable ideas.  I have not tried to reproduce the claimed
> performance benefits.  I think measuring this patch set with the tests
> posted by Shawn Debnath in [3] is important, too.
> [3] https://postgr.es/m/yemddpmrsojfq...@f01898859afd.ant.amazon.com

Thanks for taking a look.

>
> On the other hand, here's a somewhat crazy idea.  What if, instead of
> stealing buffers from shared_buffers (which causes a lot of complexity),

Currently, we do not steal buffers from shared_buffers, computation is
dependent upon Nbuffers though. I mean for each SLRU we are computing
separate memory which is additional than the shared_buffers no?

> we allocate a common pool for all SLRUs to use?  We provide a single
> knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
> algorithm (or something) to distribute that memory across all the SLRUs.
> So the ratio to use for this SLRU or that one would depend on the nature
> of the workload: maybe more for multixact in this server here, but more
> for subtrans in that server there; it's just the total amount that the
> user would have to configure, side by side with shared_buffers (and
> perhaps scale with it like wal_buffers), and the LRU would handle the
> rest.  The "only" problem here is finding a distribution algorithm that
> doesn't further degrade performance, of course ...

Yeah, this could be an idea, but are you talking about that all the
SLRUs will share the single buffer pool and based on the LRU algorithm
it will be decided which page will stay in the 

Re: Synchronizing slots from primary to standby

2023-10-24 Thread Amit Kapila
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand
 wrote:
>
> On 10/24/23 7:44 AM, Ajin Cherian wrote:
> > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
> >  wrote:
>
> 2) When we create a subscription, another slot is created during the 
> subscription synchronization, namely
> like "pg_16397_sync_16388_7293447291374081805" (coming from 
> ReplicationSlotNameForTablesync()).
>
> This extra slot appears to have failover also set to true.
>
> So, If the standby refresh the list of slot to sync when the subscription is 
> still synchronizing we'd see things like
> on the standby:
>
> LOG:  waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) 
> to pass local slot LSN (0/C0034840) and and catalog xmin (756)
> LOG:  wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog 
> xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756)
> LOG:  waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN 
> (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and 
> and catalog xmin (756)
> WARNING:  slot "pg_16397_sync_16388_7293447291374081805" disappeared from the 
> primary, aborting slot creation
>
> I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have 
> failover set to true. If there is a failover
> during the subscription creation, better to re-launch the subscription 
> instead?
>

But note that the subscription doesn't wait for the completion of
tablesync. So, how will we deal with that? Also, this situation is the
same for non-tablesync slots as well. I have given another option in
the email [1] which is to enable failover even for the main slot after
all tables are in ready state, something similar to what we do for
two_phase.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1J6BqO5%3DueFAQO%2BaYyHLaU-oCHrrVFJqHS-i0Ce9aPY2w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-10-24 Thread Michael Paquier
On Wed, Oct 25, 2023 at 10:06:17AM +0530, Amul Sul wrote:
> +1 for the feature.
> 
> TWIMW, here[1] is an interesting talk from pgconf.in 2020 on the similar
> topic.
> 
> 1] https://pgconf.in/conferences/pgconfin2020/program/proposals/101

Right, this uses a shared hash table.  There is a patch from 2019 that
summarizes this presentation as well:
https://www.postgresql.org/message-id/CANXE4TdxdESX1jKw48xet-5GvBFVSq%3D4cgNeioTQff372KO45A%40mail.gmail.com

A different idea is that this patch could leverage a bgworker instead
of having a footprint in the postmaster.  FWIW, I think that my patch
is more flexible than the modes added by faultinjector.h (see 0001),
because the actions that can be taken should not be limited by the
core code: the point registered could just use what it wants as
callback, so an extension could register a custom thing as well.
--
Michael


signature.asc
Description: PGP signature


Re: Adding facility for injection points (or probe points?) for more advanced tests

2023-10-24 Thread Amul Sul
On Wed, Oct 25, 2023 at 9:43 AM Michael Paquier  wrote:

> Hi all,
>
> I don't remember how many times in the last few years when I've had to
> hack the backend to produce a test case that involves a weird race
> condition across multiple processes running in the backend, to be able
> to prove a point or just test a fix (one recent case: 2b8e5273e949).
> Usually, I come to hardcoding stuff for the following situations:
> - Trigger a PANIC, to force recovery.
> - A FATAL, to take down a session, or just an ERROR.
> - palloc() failure injection.
> - Sleep to slow down a code path.
> - Pause and release with condition variable.


+1 for the feature.

TWIMW, here[1] is an interesting talk from pgconf.in 2020 on the similar
topic.

1] https://pgconf.in/conferences/pgconfin2020/program/proposals/101

Regards,
Amul Sul


Adding facility for injection points (or probe points?) for more advanced tests

2023-10-24 Thread Michael Paquier
Hi all,

I don't remember how many times in the last few years when I've had to
hack the backend to produce a test case that involves a weird race
condition across multiple processes running in the backend, to be able
to prove a point or just test a fix (one recent case: 2b8e5273e949).
Usually, I come to hardcoding stuff for the following situations:
- Trigger a PANIC, to force recovery.
- A FATAL, to take down a session, or just an ERROR.
- palloc() failure injection.
- Sleep to slow down a code path.
- Pause and release with condition variable.

And, while that's helpful to prove a point on a thread, nothing comes
out of it in terms of regression test coverage in the tree because
these tests are usually too slow and expensive, as they usually rely
on hardcoded timeouts.  So that's pretty much attempting to emulate
what one would do with a debugger in a predictable way, without the
manual steps because human hands don't scale well.

The reason behind that is of course more advanced testing, to be able
to expand coverage when we have weird and non-deterministic race
issues to deal with, and the code becoming more complex every year
makes that even harder.  Fault and failure injection in specific paths
comes into mind, additionally, particularly if you manage complex
projects based on Postgres.

So, please find attached a patch set that introduces an in-core
facility to be able to set what I'm calling here an "injection point",
that consists of being able to register in shared memory a callback
that can be run within a defined location of the code.  It means that
it is not possible to trigger a callback before shared memory is set,
but I've faced far more the case where I wanted to trigger something
after shmem is set anyway.  Persisting an injection point across
restarts is also possible by adding some through an extension's shmem
hook, as we do for custom LWLocks for example, as long as a library is
loaded.

This will remind a bit of what Alexander Korotkov has proposed here:
https://www.postgresql.org/message-id/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com
Also, this is much closee to what Craig Ringer is mentioning here,
where it is named probe points, but I am using a minimal design that
allows to achieve the same:
https://www.postgresql.org/message-id/CAPpHfdsn-hzneYNbX4qcY5rnwr-BA1ogOCZ4TQCKQAw9qa48kA%40mail.gmail.com

A difference is that I don't really see a point in passing to the
callback triggered an area of data coming from the hash table itself,
as at the end a callback could just refer to an area in shared memory
or a static set of variables depending on what it wants, with one or
more injection points (say a location to set a state, and a second to
check it).  So, at the end, the problem comes down in my opinion to
two things:
- Possibility to trigger a condition defined by some custom code, in 
the backend (core code or even out-of-core).
- Possibility to define a location in the code where a named point
would be checked.

0001 introduces three APIs to create, run, and drop injection points:
+extern void InjectionPointCreate(const char *name,
+InjectionPointCallback callback);
+extern void InjectionPointRun(const char *name);
+extern void InjectionPointDrop(const char *name);

Then one just needs to add a macro like that to trigger the callback
registered in the code to test:
INJECTION_POINT_RUN("String");
So the footprint in the core tree is not zero, but it is as minimal as
it can be.

I have added some documentation to explain that, as well.  I am not
wedded to the name proposed in the patch, so if you feel there is
better, feel free to propose ideas.

This facility is hidden behind a specific configure/Meson switch,
making it a no-op by default:
--enable-injection-points
-Dinjection_points={ true | false }

0002 is a test module to test these routines, that I have kept a
maximum simple to ease review of the basics proposed here.  This could
be extended further to propose more default modes with TAP tests on
its own, as I don't see a real point in having the SQL bits or some
common callbacks (like for the PANIC or the FATAL cases) in core.

Thoughts and comments are welcome.
--
Michael
From 4f7a0627eb0735874b53c90576243f8003a64b4a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 25 Oct 2023 10:24:28 +0900
Subject: [PATCH 1/2] Base facility for injection points

Extension and backend developers can register custom callbacks that
would be run on a user-defined fashion.  This is hidden behind a
./configure and a meson switch, disabled by default.
---
 src/include/pg_config.h.in|   3 +
 src/include/utils/injection_point.h   |  35 
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/storage/lmgr/lwlocknames.txt  |   1 +
 .../utils/activity/wait_event_names.txt   |   1 +
 src/backend/utils/misc/Makefile   |   1 +
 src/backend/utils/misc/injection_point.c  | 158 

Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

2023-10-24 Thread John Naylor
On Tue, Oct 17, 2023 at 9:39 PM Robert Haas  wrote:
>
> Cool. Seems we are all in agreement, so committed these. Thanks!

Thank you for getting this across the finish line!




Re: RFC: Logging plan of the running query

2023-10-24 Thread Ashutosh Bapat
On Wed, Oct 18, 2023 at 10:04 PM James Coleman  wrote:
>
> While that decision as regards auto_explain has long since been made
> (and there would be work to undo it), I don't believe that we should
> repeat that choice here. I'm -10 on moving this into auto_explain.
>

Right.

> However perhaps there is still an opportunity for moving some of the
> auto_explain code into core so as to enable deduplicating the code.
>

Right. That's what I also think.

-- 
Best Wishes,
Ashutosh Bapat




RE: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Xiang Gao
Thanks for your suggestion, this is the modified patch and two test files.

-Original Message-
From: Michael Paquier 
Sent: Friday, October 20, 2023 4:19 PM
To: Xiang Gao 
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: CRC32C Parallel Computation Optimization on ARM

On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote:
> This patch uses a parallel computing optimization algorithm to improve
> crc32c computing performance on ARM. The algorithm comes from Intel
> whitepaper:
> crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
> into three equal-sized blocks.Three parallel blocks (crc0, crc1,
> crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
> crc32c_u64) bytes
>
> Crc32c unitest:
> https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
> Crc32c benchmark:
> https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
> It gets ~2x speedup compared to linear Arm crc32c instructions.

Interesting.  Could you attached to this thread the test files you used and the 
results obtained please?  If this data gets deleted from github, then it would 
not be possible to refer back to what you did at the related benchmark results.

Note that your patch is forgetting about meson; it just patches ./configure.
--
Michael
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


0002-crc32c-parallel-computation-optimization-on-arm.patch
Description:  0002-crc32c-parallel-computation-optimization-on-arm.patch
/*
* compile postgres first with different crc32c implementation(use arm vmull_p64 
or not)
* we should comment out some codes about elog in pg_crc32c_armv8_choose.c to 
compile correctly and simply.
* $ gcc   -I ../postgres/_install/include -I 
../postgres/_install/include/server main.c \
* -L ../postgres/build/src/port -l pgport_srv -O2  -o main

* this test was run on Neoverse-N1
* $ ./main.no_vmull
* data size is 512 bytes, and compute crc cost 139 us totally, 0.135742 us per 
loop
* data size is 4096 bytes, and compute crc cost 1061 us totally, 1.036133 us 
per loop

* $ ./main.use_vmull
* data size is 512 bytes, and compute crc cost 101 us totally, 0.098633 us per 
loop
* data size is 4096 bytes, and compute crc cost 540 us totally, 0.527344 us per 
loop

* We can see that the cost of computing crc32c without vmull_p64 is about two 
times than
* the cost that using vmull_p64 when data size is large. and the cost is almost 
same when 
* data size is small.
*/

#include 
#include 
#include 
#include 
#include 

#include "c.h"
#include "port/pg_crc32c.h"

uint64_t
GetTickCount()
{
struct timeval tv;

gettimeofday(, NULL);
return tv.tv_sec * 100 + tv.tv_usec;
}

int
main()
{
#define CASE_CNT 2
uint32_ttest_size[CASE_CNT] = {512, 1024 * 4};

for (int case_cnt = 0; case_cnt < CASE_CNT; case_cnt++)
{
uint8_t*buf = (uint8_t *) malloc(test_size[case_cnt] * 
sizeof(uint8_t));

srand(0);
for (int i = 0; i < test_size[case_cnt]; i++)
{
*(buf + i) = (uint8_t) (rand() % 256u);
}

static const uint32_t kLoop = 1024;
uint32_tcrc = 0;
uint64_tstart = GetTickCount();

INIT_CRC32C(crc);
for (int i = 0; i < kLoop; i++)
{
COMP_CRC32C(crc, buf, test_size[case_cnt]);
}
FIN_CRC32C(crc);
uint64_tstop = GetTickCount();

printf("data size is %d bytes, and compute crc cost %ld us 
totally, %f us per loop\n", test_size[case_cnt], stop - start, (double) (stop - 
start) / kLoop);

free(buf);
}
#undef CASE_CNT
return 0;
}

/***
* We use libcheck(https://github.com/libcheck/check) as unit testing framework.

* compile postgres first with different crc32c implementation(use arm crc32c
* and vmull intrisics or not). we should comment out some codes about elog in
* pg_crc32c_armv8_choose.c to compile correctly and simply.
* $ gcc -I ../postgres/_install/include -I ../postgres/_install/include/server \
  crc32c_unittest.c  -L ../postgres/build/src/port -l pgport_srv  -L 
/usr/local/lib \
  -lcheck  -o crc32c_unittest

* this test was run on Neoverse-N1
* $ ./crc32c_unittest 
* Running suite(s): CRC32C
* 100%: Checks: 3, Failures: 0, Errors: 0

Re: post-recovery amcheck expectations

2023-10-24 Thread Noah Misch
On Tue, Oct 24, 2023 at 07:03:34PM -0700, Peter Geoghegan wrote:
> On Mon, Oct 23, 2023 at 7:28 PM Noah Misch  wrote:
> > > That makes sense to me. I believe that it's not possible to have a
> > > string of consecutive sibling pages that are all half-dead (regardless
> > > of the BlockNumber order of sibling pages, even). But I'd probably
> > > have written the fix in roughly the same way. Although...maybe you
> > > should try to detect a string of half-dead pages? Hard to say if it's
> > > worth the trouble.
> >
> > I imagined a string of half-dead siblings could arise in structure like 
> > this:
> >
> >  *   1
> >  *  /|\
> >  * 4 <-> 2 <-> 3
> >
> > With events like this:
> >
> > - DELETE renders blk 4 deletable.
> > - Crash with concurrent VACUUM, leaving 4 half-dead after having visited 
> > 1-4.
> > - DELETE renders blk 2 deletable.
> > - Crash with concurrent VACUUM, leaving 2 half-dead after having visited 
> > 1-2.
> >
> > I didn't try to reproduce that, and something may well prevent it.
> 
> FWIW a couple of factors prevent it (in the absence of corruption). These are:
> 
> 1. Only VACUUM can delete pages, and in general the only possible
> source of half-dead pages is an unfortunately timed crash/error within
> VACUUM. Each interrupted VACUUM can leave behind at most one half-dead
> page.

Agreed.

> 2. One thing that makes VACUUM back out of deleting an empty page is
> the presence of a half-dead right sibling leaf page left behind by
> some VACUUM that was interrupted at some point in the past -- see
> _bt_rightsib_halfdeadflag() for details.
> 
> Obviously, factors 1 and 2 together make three consecutive half-dead
> sibling pages impossible.

Can't it still happen if the sequence of unfortunately timed crashes causes
deletions from left to right?  Take this example, expanding the one above.
Half-kill 4, crash, half-kill 3, crash, half-kill 2 in:

 *  1
 *   // | \\
 * 4 <-> 3 <-> 2 <-> 1

(That's not to say it has ever happened outside of a test.)




Re: Synchronizing slots from primary to standby

2023-10-24 Thread shveta malik
On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 10/23/23 2:56 PM, shveta malik wrote:
> > On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand
> >  wrote:
>
> >> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking 
> >> if there
> >> is new synced slot(s) to be created on the standby. Do we want to keep 
> >> this behavior
> >> for V1?
> >>
> >
> > I think for the slotsync workers case, we should reduce the naptime in
> > the launcher to say 30sec and retain the default one of 3mins for
> > subscription apply workers. Thoughts?
> >
>
> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new
> API on the standby that would refresh the list of sync slot at wish, thoughts?
>

Do you mean API to refresh list of DBIDs rather than sync-slots?
As per current design, launcher gets DBID lists for all the failover
slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE.
These dbids are then distributed among max slot-sync workers and then
they fetch slots for the concerned DBIDs at regular intervals of 10ms
(WORKER_DEFAULT_NAPTIME_MS) and create/update those locally.

thanks
Shveta




Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-24 Thread David Rowley
On Wed, 25 Oct 2023 at 06:00, Jelte Fennema  wrote:
> Attached is a fixed version.

With foreach(), we commonly do "if (lc == NULL)" at the end of loops
as a way of checking if we did "break" to terminate the loop early.
Doing the equivalent with the new macros won't be safe as the list
element's value we broke on may be set to NULL. I think it might be a
good idea to document the fact that this wouldn't be safe with the new
macros, or better yet, document the correct way to determine if we
broke out the loop early.  I imagine someone will want to do some
conversion work at some future date and it would be good if we could
avoid introducing bugs during that process.

I wonder if we should even bother setting the variable to NULL at the
end of the loop.  It feels like someone might just end up mistakenly
checking for NULLs even if we document that it's not safe.  If we left
the variable pointing to the last list element then the user of the
macro is more likely to notice their broken code. It'd also save a bit
of instruction space.

David




Re: Row pattern recognition

2023-10-24 Thread Tatsuo Ishii
> Great. I will look into this.

I am impressed the simple NFA implementation.  It would be nicer if it
could be implemented without using recursion.

> By the way, I tested my patch (v10) to handle more large data set and
> tried to following query with pgbench database. On my laptop it works
> with 100k rows pgbench_accounts table but with beyond the number I got
   ~~~ I meant 10k.

> OOM killer. I would like to enhance this in the next patch.
> 
> SELECT aid, first_value(aid) OVER w,
> count(*) OVER w
> FROM pgbench_accounts
> WINDOW w AS (
> PARTITION BY bid
> ORDER BY aid
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> AFTER MATCH SKIP PAST LAST ROW
> INITIAL
> PATTERN (START UP+)
> DEFINE
> START AS TRUE,
> UP AS aid > PREV(aid)
> );

I ran this against your patch. It failed around > 60k rows.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: Patch: Improve Boolean Predicate JSON Path Docs

2023-10-24 Thread David E. Wheeler
On Oct 23, 2023, at 20:20, Erik Wienhold  wrote:

> I thought that you may have missed that one because I saw this change
> that removes the article:
> 
>> -In the strict mode, the specified path must exactly match the structure 
>> of
>> +In strict mode, the specified path must exactly match the structure of

Oh, didn’t realize. Fixed.

> LGTM.  Would you create a commitfest entry?  I'll set the status to RfC.

Done. 

  https://commitfest.postgresql.org/45/4624/

Best,

David



v7-0001-Improve-boolean-predicate-JSON-Path-docs.patch
Description: Binary data


Re: post-recovery amcheck expectations

2023-10-24 Thread Peter Geoghegan
On Mon, Oct 23, 2023 at 7:28 PM Noah Misch  wrote:
> > That makes sense to me. I believe that it's not possible to have a
> > string of consecutive sibling pages that are all half-dead (regardless
> > of the BlockNumber order of sibling pages, even). But I'd probably
> > have written the fix in roughly the same way. Although...maybe you
> > should try to detect a string of half-dead pages? Hard to say if it's
> > worth the trouble.
>
> I imagined a string of half-dead siblings could arise in structure like this:
>
>  *   1
>  *  /|\
>  * 4 <-> 2 <-> 3
>
> With events like this:
>
> - DELETE renders blk 4 deletable.
> - Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4.
> - DELETE renders blk 2 deletable.
> - Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2.
>
> I didn't try to reproduce that, and something may well prevent it.

FWIW a couple of factors prevent it (in the absence of corruption). These are:

1. Only VACUUM can delete pages, and in general the only possible
source of half-dead pages is an unfortunately timed crash/error within
VACUUM. Each interrupted VACUUM can leave behind at most one half-dead
page.

2. One thing that makes VACUUM back out of deleting an empty page is
the presence of a half-dead right sibling leaf page left behind by
some VACUUM that was interrupted at some point in the past -- see
_bt_rightsib_halfdeadflag() for details.

Obviously, factors 1 and 2 together make three consecutive half-dead
sibling pages impossible. I'm not quite prepared to say that even two
neighboring half-dead sibling pages are an impossibility right now,
but I think that it might well be. Possibly for reasons that are more
accidental than anything else (is the _bt_rightsib_halfdeadflag thing
a "real invariant" or just something we do because we don't want to
add additional complicated handling to nbtpage.c?), so I'll avoid
going into further detail for now.

I'm pointing this out because it argues for softening the wording
about "accept[ing] an arbitrarily-long chain of half-dead,
sibling-linked pages to the left" from your patch.

I was also wondering (mostly to myself) about the relationship (if
any) between the _bt_rightsib_halfdeadflag/_bt_leftsib_splitflag
"invariants" and the bt_child_highkey_check() check. But I don't think
that it makes sense to put that in scope -- your fix seems like a
strict improvement. This relationship is perhaps in scope here to the
limited extent that talking about strings of consecutive half-dead
pages might make it even harder to understand the design of the
bt_child_highkey_check() check. On the other hand...I'm not sure that
I understand every nuance of it myself.

> > Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just
> > for good luck.
>
> Added.  That gave me the idea to check for circular links, like other parts of
> amcheck do.

Good idea.

-- 
Peter Geoghegan




Re: Synchronizing slots from primary to standby

2023-10-24 Thread shveta malik
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 10/24/23 7:44 AM, Ajin Cherian wrote:
> > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
> >  wrote:
> >>
> >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
> >>   SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
> >>   }
> >>
> >> +   /* set failover in the slot, as requested */
> >> +   slot->data.failover = ctx->failover;
> >> +
> >>
> >> I think we can get rid of this change in CreateDecodingContext().
> >>
> > Yes, I too noticed this in my testing, however just removing this from
> > CreateDecodingContext will not allow us to change the slot's failover flag
> > using Alter subscription.
>
> Oh right.
>
> > I am thinking of moving this change to
> > StartLogicalReplication prior to calling CreateDecodingContext by
> > parsing the command options in StartReplicationCmd
> > without adding it to the LogicalDecodingContext.
> >
>
> Yeah, that looks like a good place to update "failover".
>
> Doing more testing and I have a couple of remarks about he current behavior.
>
> 1) Let's imagine that:
>
> - there is no standby
> - standby_slot_names is set to a valid slot on the primary (but due to the 
> above, not linked to any standby)
> - then a create subscription on a subscriber WITH (failover = true) would 
> start the
> synchronisation but never finish (means leaving a "synchronisation" slot like 
> "pg_32811_sync_24576_7293415241672430356"
> in place coming from ReplicationSlotNameForTablesync()).
>
> That's expected, but maybe we should emit a warning in 
> WalSndWaitForStandbyConfirmation() on the primary when there is
> a slot part of standby_slot_names which is not active/does not have an 
> active_pid attached to it?
>

Agreed, Will do that.

> 2) When we create a subscription, another slot is created during the 
> subscription synchronization, namely
> like "pg_16397_sync_16388_7293447291374081805" (coming from 
> ReplicationSlotNameForTablesync()).
>
> This extra slot appears to have failover also set to true.
>
> So, If the standby refresh the list of slot to sync when the subscription is 
> still synchronizing we'd see things like
> on the standby:
>
> LOG:  waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) 
> to pass local slot LSN (0/C0034840) and and catalog xmin (756)
> LOG:  wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog 
> xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756)
> LOG:  waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN 
> (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and 
> and catalog xmin (756)
> WARNING:  slot "pg_16397_sync_16388_7293447291374081805" disappeared from the 
> primary, aborting slot creation
>
> I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have 
> failover set to true. If there is a failover
> during the subscription creation, better to re-launch the subscription 
> instead?
>

'Failover' property of subscription is carried to the tablesync-slot
in recent versions only with the intent that if failover happens
during create-sub during table-sync of large tables, then users should
be able to start from that point onward on the new primary. But yes,
the above scenario is highly probable where-in no activity is
happening on primary and thus the table-sync slot is waiting for its
creation during sync on standby. So I agree, to simplify the stuff we
can skip table-sync slot syncing on standby and document this
behaviour.

thanks
Shveta




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-24 Thread Nathan Bossart
On Tue, Oct 24, 2023 at 05:15:19PM -0700, Jeff Davis wrote:
> * Are you sure that reducing the number of calls to memcpy() is a win?
> I would expect that to be true only if the memcpy()s are tiny, but here
> they are around XLOG_BLCKSZ. I believe this was done based on a comment
> from Nathan Bossart, but I didn't really follow why that's important.
> Also, if we try to use one memcpy for all of the data, it might not
> interact well with my idea above to avoid taking the lock.

I don't recall exactly why I suggested this, but if additional memcpy()s
help in some way and don't negatively impact performance, then I retract my
previous comment.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-24 Thread David G. Johnston
On Tue, Oct 24, 2023 at 1:39 PM Bruce Momjian  wrote:

> On Tue, Dec 13, 2022 at 07:38:15PM -0700, David G. Johnston wrote:
> > All,
> >
> > The recent discussion surrounding aggregates and ORDER BY moved me to
> look over
> > our existing documentation, especially now that we've reworked the
> function
> > tables, to see what improvements can be had by simply documenting those
> > functions where ORDER BY may change the user-visible output.  I skipped
> range
> > aggregates for the moment but handled the others on the aggregates page
> (not
> > window functions).  This includes the float types for sum and avg.
> >
> > I added a note just before the table linking back to the syntax chapter
> and
> > describing the newly added rules and syntax choice in the table.
> >
> > The nuances of floating point math suggest to me that specifying order
> by for
> > those is in some kind of gray area and so I've marked it optional...any
> > suggestions for wording (or an xref) to explain those nuances or should
> it just
> > be shown non-optional like the others?  Or not shown at all?
> >
> > The novelty of my examples is up for bikeshedding.  I didn't want
> anything too
> > long so a subquery didn't make sense, and I was trying to avoid
> duplication as
> > well as multiple lines - hence creating a CTE that can be copied onto
> all of
> > the example queries to produce the noted result.
> >
> > I added a DISTINCT example to array_agg because it is the first
> aggregate on
> > the page and so hopefully will be seen during a cursory reading.  Plus,
> > array_agg is the go-to function for doing this kind of experimentation.
>
> I like this idea, though the examples seemed too detailed so I skipped
> them.  Here is the trimmed-down patch I would like to apply.
>
>
I'd prefer to keep pointing out that the ones documented are those whose
outputs will vary due to ordering.

I've been sympathetic to the user comments that we don't have enough
examples.  Just using array_agg for that purpose, showing both DISTINCT and
ORDER BY seems like a fair compromise (removes two from my original
proposal).  The examples in the section we tell them to go see aren't of
that great quality.  If you strongly dislike having the function table
contain the examples we should at least improve the page we are sending
them to.  (As an aside to this, I've personally always found the syntax
block with the 5 syntaxes shown there to be intimidating/hard-to-read).

I'd at least suggest you reconsider the commentary and examples surrounding
jsonb_object_agg.

The same goes for the special knowledge of floating point behavior for why
we've chosen to document avg/sum, something that typically doesn't care
about order, as having an optional order by.

David J.


Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Nathan Bossart
On Wed, Oct 25, 2023 at 07:17:55AM +0900, Michael Paquier wrote:
> If you are looking at computing the CRC of records with arbitrary
> sizes, why not just generating a series with
> pg_logical_emit_message() before doing a comparison with pg_waldump or
> a custom replay loop to go through the records?  At least it would
> make the results more predictible.

I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).

Given the performance characteristics and relative simplicity of the patch,
I think this could be worth doing.  I suspect we'll want to do something
similar for x86, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: MERGE ... RETURNING

2023-10-24 Thread Merlin Moncure
On Tue, Oct 24, 2023 at 2:11 PM Jeff Davis  wrote:

> On Wed, 2023-08-23 at 11:58 +0100, Dean Rasheed wrote:
> > Updated version attached, fixing an uninitialized-variable warning
> > from the cfbot.
>
> I took another look and I'm still not comfortable with the special
> IsMergeSupportFunction() functions. I don't object necessarily -- if
> someone else wants to commit it, they can -- but I don't plan to commit
> it in this form.
>
> Can we revisit the idea of a per-WHEN RETURNING clause? The returning
> clauses could be treated kind of like a UNION, which makes sense
> because it really is a union of different results (the returned tuples
> from an INSERT are different than the returned tuples from a DELETE).
> You can just add constants to the target lists to distinguish which
> WHEN clause they came from.
>
> I know you rejected that approach early on, but perhaps it's worth
> discussing further?
>

 Yeah.  Side benefit, the 'action_number' felt really out of place, and
that neatly might solve it.  It doesn't match tg_op, for example.  With the
current approach, return a text, or an enum? Why doesn't it match concepts
that are pretty well established elsewhere?  SQL has a pretty good track
record for not inventing weird numbers with no real meaning (sadly, not so
much the developers).   Having said that, pg_merge_action() doesn't feel
too bad if the syntax issues can be worked out.

Very supportive of the overall goal.

merlin


Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-10-24 Thread David Rowley
On Tue, 17 Oct 2023 at 20:39, David Rowley  wrote:
>
> On Mon, 16 Oct 2023 at 05:56, Tom Lane  wrote:
> > I guess the next question is whether we want to stop here or
> > try to relax the requirement about NUL-termination.  I'd be inclined
> > to call that a separate issue deserving a separate commit, so maybe
> > we should go ahead and commit this much anyway.
>
> I am keen to see this relaxed. I agree that a separate effort is best.

I looked at the latest posted patch again today with thoughts about
pushing it but there's something I'm a bit unhappy with that makes me
think we should maybe do the NUL-termination relaxation in the same
commit.

The problem is in LogicalRepApplyLoop() the current patch adjusts the
manual building of the StringInfoData to make use of
initReadOnlyStringInfo() instead. The problem I have with that is that
the string that's given to initReadOnlyStringInfo() comes from
walrcv_receive() and on looking at the API spec for walrcv_receive_fn
I see:

/*
 * walrcv_receive_fn
 *
 * Receive a message available from the WAL stream.  'buffer' is a pointer
 * to a buffer holding the message received.  Returns the length of the data,
 * 0 if no data is available yet ('wait_fd' is a socket descriptor which can
 * be waited on before a retry), and -1 if the cluster ended the COPY.
 */

i.e, no mention that the buffer will be NUL terminated upon return.

Looking at pqGetCopyData3(), is see the buffer does get NUL
terminated, but without the API spec mentioning this I'm not feeling
good about going ahead with wrapping that up in
initReadOnlyStringInfo() which Asserts the buffer will be NUL
terminated.

I've attached a patch which builds on the previous patch and relaxes
the rule that the StringInfo must be NUL-terminated.   The rule is
only relaxed for StringInfos that are initialized with
initReadOnlyStringInfo.  On working on this I went over the locations
where we've added code to add a '\0' char to the buffer.  If you look
at, for example, record_recv() and array_agg_deserialize() in master,
we modify the StringInfo's data to set a \0 at the end of the string.
I've removed that code as I *believe* this isn't required for the
type's receive function.

There's also an existing confusing comment in logicalrep_read_tuple()
which seems to think we're just setting the NUL terminator to conform
to StringInfo's practises. This is misleading as the NUL is required
for LOGICALREP_COLUMN_TEXT mode as we use the type's input function
instead of the receive function.  You don't have to look very hard to
find an input function that needs a NUL terminator.

I'm a bit less confident that the type's receive function will never
need to be NUL terminated. cstring_recv() came to mind as one I should
look at, but on looking I see it's not required as it just reads the
remaining bytes from the input StringInfo.  Is it safe to assume this?
or could there be some UDF receive function which requires this?

David
diff --git a/src/backend/replication/logical/applyparallelworker.c 
b/src/backend/replication/logical/applyparallelworker.c
index 82f48a488e..9b37736f8e 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -774,10 +774,7 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (len == 0)
elog(ERROR, "invalid message length");
 
-   s.cursor = 0;
-   s.maxlen = -1;
-   s.data = (char *) data;
-   s.len = len;
+   initReadOnlyStringInfo(, data, len);
 
/*
 * The first byte of messages sent from leader apply 
worker to
diff --git a/src/backend/replication/logical/proto.c 
b/src/backend/replication/logical/proto.c
index d52c8963eb..4eaeb14148 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -879,6 +879,7 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData 
*tuple)
/* Read the data */
for (i = 0; i < natts; i++)
{
+   char   *buff;
charkind;
int len;
StringInfo  value = >colvalues[i];
@@ -899,19 +900,18 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData 
*tuple)
len = pq_getmsgint(in, 4);  /* read length 
*/
 
/* and data */
-   value->data = palloc(len + 1);
-   pq_copymsgbytes(in, value->data, len);
+   buff = palloc(len + 1);
+   pq_copymsgbytes(in, buff, len);
 
/*
-* Not strictly necessary for 
LOGICALREP_COLUMN_BINARY, but
-* per StringInfo practice.
+  

Re: Show version of OpenSSL in ./configure output

2023-10-24 Thread Michael Paquier
On Mon, Oct 23, 2023 at 08:44:01PM -0400, Tom Lane wrote:
> OK by me.

Cool, done down to 16 as this depends on c8e4030d1bdd.
--
Michael


signature.asc
Description: PGP signature


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-24 Thread Jeff Davis
On Sat, 2023-10-21 at 23:59 +0530, Bharath Rupireddy wrote:
> I'm attaching v12 patch set with just pgperltidy ran on the new TAP
> test added in 0002. No other changes from that of v11 patch set.

Thank you.

Comments:

* It would be good to document that this is partially an optimization
(read from memory first) and partially an API difference that allows
reading unflushed data. For instance, walsender may benefit
performance-wise (and perhaps later with the ability to read unflushed
data) whereas pg_walinspect benefits primarily from reading unflushed
data.

* Shouldn't there be a new method in XLogReaderRoutine (e.g.
read_unflushed_data), rather than having logic in WALRead()? The
callers can define the method if it makes sense (and that would be a
good place to document why); or leave it NULL if not.

* I'm not clear on the "partial hit" case. Wouldn't that mean you found
the earliest byte in the buffers, but not the latest byte requested? Is
that possible, and if so under what circumstances? I added an
"Assert(nread == 0 || nread == count)" in WALRead() after calling
XLogReadFromBuffers(), and it wasn't hit.

* If the partial hit case is important, wouldn't XLogReadFromBuffers()
fill in the end of the buffer rather than the beginning?

* Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
expectedEndPtr both before and after the memcpy(), with appropriate
barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
and Masahiko Sawada.

* Are you sure that reducing the number of calls to memcpy() is a win?
I would expect that to be true only if the memcpy()s are tiny, but here
they are around XLOG_BLCKSZ. I believe this was done based on a comment
from Nathan Bossart, but I didn't really follow why that's important.
Also, if we try to use one memcpy for all of the data, it might not
interact well with my idea above to avoid taking the lock.

* Style-wise, the use of "unlikely" seems excessive, unless there's a
reason to think it matters.

Regards,
Jeff Davis





Re: Row pattern recognition

2023-10-24 Thread Tatsuo Ishii
> On Sat, Oct 21, 2023 at 7:39 PM Tatsuo Ishii  wrote:
>> Attached is the v10 patch. This version enhances the performance of
>> pattern matching.
> 
> Nice! I've attached a couple of more stressful tests (window
> partitions of 1000 rows each). Beware that the second one runs my
> desktop out of memory fairly quickly with the v10 implementation.
> 
> I was able to carve out some time this week to implement a very basic
> recursive NFA, which handles both the + and * qualifiers (attached).

Great. I will look into this.

> It's not production quality -- a frame on the call stack for every row
> isn't going to work

Yeah.

> -- but with only those two features, it's pretty
> tiny, and it's able to run the new stress tests with no issue. If I
> continue to have time, I hope to keep updating this parallel
> implementation as you add features to the StringSet implementation,
> and we can see how it evolves. I expect that alternation and grouping
> will ratchet up the complexity.

Sounds like a plan.

By the way, I tested my patch (v10) to handle more large data set and
tried to following query with pgbench database. On my laptop it works
with 100k rows pgbench_accounts table but with beyond the number I got
OOM killer. I would like to enhance this in the next patch.

SELECT aid, first_value(aid) OVER w,
count(*) OVER w
FROM pgbench_accounts
WINDOW w AS (
PARTITION BY bid
ORDER BY aid
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
AFTER MATCH SKIP PAST LAST ROW
INITIAL
PATTERN (START UP+)
DEFINE
START AS TRUE,
UP AS aid > PREV(aid)
);

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


walwriter interacts quite badly with synchronous_commit=off

2023-10-24 Thread Andres Freund
Hi,

I recently mentioned to Heikki that I was seeing latch related wakeups being
frequent enough to prevent walwriter from doing a whole lot of work. He asked
me to write that set of concerns up, which seems quite fair...


Here's a profile of walwriter while the following pgbench run was ongoing:

c=1;psql -Xq -c 'drop table if exists testtable_logged; CREATE TABLE 
testtable_logged(v int not null default 0);' && PGOPTIONS='-c 
synchronous_commit=off' pgbench -n -c$c -j$c -Mprepared -T150 -f <(echo 'INSERT 
INTO testtable_logged DEFAULT VALUES;') -P1

Looking at top, walwriter is around 15-20% busy with this
workload. Unfortunately a profile quickly shows that little of that work is
useful:

perf record --call-graph dwarf -m16M -p $(pgrep -f 'walwriter') sleep 3

-   94.42% 0.00%  postgres  postgres  [.] AuxiliaryProcessMain
 AuxiliaryProcessMain
   - WalWriterMain
  + 78.26% WaitLatch
  + 14.01% XLogBackgroundFlush
  + 0.51% pgstat_report_wal
0.29% ResetLatch
0.13% pgstat_flush_io
  + 0.02% asm_sysvec_apic_timer_interrupt
0.01% HandleWalWriterInterrupts (inlined)


Confirmed by the distribution of what syscalls are made:

perf trace -m128M --summary -p $(pgrep -f 'walwriter') sleep 5
   syscallcalls  errors  total   min   avg   max   
stddev
 (msec)(msec)(msec)(msec)   
 (%)
   ---   --  - - - 
--
   epoll_wait216610  0  3744.984 0.000 0.017 0.113  
0.03%
   read  216602  0   333.905 0.001 0.002 0.029  
0.03%
   fdatasync 27  094.703 1.939 3.50811.279  
8.83%
   pwrite642998  015.646 0.004 0.005 0.027  
0.45%
   openat 2  0 0.019 0.006 0.010 0.013 
34.84%
   close  2  0 0.004 0.002 0.002 0.003 
25.76%

We're doing far more latch related work than actual work.

The walwriter many many times wakes up without having to do anything.

And if you increase the number of clients to e.g. c=8, it gets worse in some
ways:

perf trace:
   epoll_wait291512  0  2364.067 0.001 0.008 0.693  
0.10%
   read  290938  0   479.837 0.001 0.002 0.020  
0.05%
   fdatasync146  0   410.043 2.508 2.809 7.006  
1.90%
   futex  56384  43982   183.896 0.001 0.003 2.791  
1.65%
   pwrite64   17058  0   105.625 0.004 0.006 4.015  
4.61%
   clock_nanosleep1  0 1.063 1.063 1.063 1.063  
0.00%
   openat 9  0 0.072 0.006 0.008 0.014 
14.35%
   close  9  0 0.018 0.002 0.002 0.003  
5.55%

Note that we 5x more lock waits (the futex calls) than writes!


I think the problem is mainly that XLogSetAsyncXactLSN() wakes up walwriter
whenever it is sleeping, regardless of whether the modified asyncXactLSN will
lead to a write.  We even wake up walwriter when we haven't changed
asyncXactLSN, because our LSN is older than some other backends!

So often we'll just wake up walwriter, which finds no work, immediately goes
to sleep, just to be woken again.

Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...

Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.

We also acquire WALWriteLock and call WaitXLogInsertionsToFinish(), even if
could already know we're not going to flush! Not cheap, when you do it this
many times a second.


There is an absolutely basic optimization, helping a it at higher client
counts: Don't wake if the new asyncXactLSN is <= the old one. But it doesn't
help that much.

I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay.  I think we can just reuse WalWriterFlushAfter for this.

E.g. a condition like
if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * 
XLOG_BLCKSZ)
return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.

1 client:

before:
tps = 42926.288765 (without initial connection time)

   syscallcalls  errors  total   min   avg   max   
stddev
  

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Michael Paquier
On Wed, Oct 25, 2023 at 12:37:45AM +0300, Heikki Linnakangas wrote:
> On 25/10/2023 00:18, Nathan Bossart wrote:
>> Actually, since the pg_waldump benchmark likely only involves very small
>> WAL records, it would make sense that there isn't much difference.
>> *facepalm*
> 
> No need to guess, pg_waldump -z will tell you what the record size is. And
> you can vary it by changing the checkpoint interval and/or pgbench scale
> factor: if you checkpoint frequently or if the database is larger, you get
> more full-page images which makes the records larger on average, and vice
> versa.

If you are looking at computing the CRC of records with arbitrary
sizes, why not just generating a series with
pg_logical_emit_message() before doing a comparison with pg_waldump or
a custom replay loop to go through the records?  At least it would
make the results more predictible.
--
Michael


signature.asc
Description: PGP signature


Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Heikki Linnakangas

On 25/10/2023 00:18, Nathan Bossart wrote:

On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:

I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions).  There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.

However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0].  I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.


Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*


No need to guess, pg_waldump -z will tell you what the record size is. 
And you can vary it by changing the checkpoint interval and/or pgbench 
scale factor: if you checkpoint frequently or if the database is larger, 
you get more full-page images which makes the records larger on average, 
and vice versa.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-24 Thread Nathan Bossart
On Tue, Oct 24, 2023 at 06:58:04PM +0200, Jelte Fennema wrote:
> On Tue, 24 Oct 2023 at 18:47, Nathan Bossart  wrote:
>> BTW after applying your patches, initdb began failing with the following
>> for me:
>>
>> TRAP: failed Assert("n >= 0 && n < list->length"), File: "list.c", 
>> Line: 770, PID: 902807
> 
> Oh oops... That was an off by one error in the modified
> foreach_delete_current implementation.
> Attached is a fixed version.

Thanks, that fixed it for me, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Nathan Bossart
On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:
> I'm able to reproduce the speedup with the provided benchmark on an Apple
> M1 Pro (which appears to have the required instructions).  There was almost
> no change for the 512-byte case, but there was a ~60% speedup for the
> 4096-byte case.
> 
> However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
> benchmark [0].  I haven't had a chance to dig further, unfortunately.
> Assuming I'm not doing something wrong, I don't think such a result should
> necessarily disqualify this optimization, though.

Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Nathan Bossart
On Fri, Oct 20, 2023 at 05:18:56PM +0900, Michael Paquier wrote:
> On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote:
>> This patch uses a parallel computing optimization algorithm to
>> improve crc32c computing performance on ARM. The algorithm comes
>> from Intel whitepaper:
>> crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
>> into three equal-sized blocks.Three parallel blocks (crc0, crc1,
>> crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
>> crc32c_u64) bytes 
>> 
>> Crc32c unitest: 
>> https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
>> Crc32c benchmark: 
>> https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
>> It gets ~2x speedup compared to linear Arm crc32c instructions.
> 
> Interesting.  Could you attached to this thread the test files you
> used and the results obtained please?  If this data gets deleted from
> github, then it would not be possible to refer back to what you did at
> the related benchmark results.
> 
> Note that your patch is forgetting about meson; it just patches
> ./configure.

I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions).  There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.

However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0].  I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.

[0] https://postgr.es/m/ec487192-f6aa-509a-cacb-6642dad14209%40iki.fi

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-10-24 Thread Daniel Gustafsson
> On 24 Oct 2023, at 22:34, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> I went ahead and applied this on master, thanks for review!  Now to see if
>> there will be any noticeable difference in resource usage.
> 
> I think that tools like Coverity are likely to whine about your
> use of sprintf instead of snprintf.  Sure, it's perfectly safe,
> but that won't stop the no-sprintf-ever crowd from complaining.

Fair point, that's probably quite likely to happen.  I can apply an snprintf()
conversion change like this in the two places introduced by this:

-sprintf(s, "%d", port);
+sprintf(s, sizeof(s), "%d", port);

--
Daniel Gustafsson





Re: Document aggregate functions better w.r.t. ORDER BY

2023-10-24 Thread Bruce Momjian
On Tue, Dec 13, 2022 at 07:38:15PM -0700, David G. Johnston wrote:
> All,
> 
> The recent discussion surrounding aggregates and ORDER BY moved me to look 
> over
> our existing documentation, especially now that we've reworked the function
> tables, to see what improvements can be had by simply documenting those
> functions where ORDER BY may change the user-visible output.  I skipped range
> aggregates for the moment but handled the others on the aggregates page (not
> window functions).  This includes the float types for sum and avg.
> 
> I added a note just before the table linking back to the syntax chapter and
> describing the newly added rules and syntax choice in the table.
> 
> The nuances of floating point math suggest to me that specifying order by for
> those is in some kind of gray area and so I've marked it optional...any
> suggestions for wording (or an xref) to explain those nuances or should it 
> just
> be shown non-optional like the others?  Or not shown at all?
> 
> The novelty of my examples is up for bikeshedding.  I didn't want anything too
> long so a subquery didn't make sense, and I was trying to avoid duplication as
> well as multiple lines - hence creating a CTE that can be copied onto all of
> the example queries to produce the noted result.
> 
> I added a DISTINCT example to array_agg because it is the first aggregate on
> the page and so hopefully will be seen during a cursory reading.  Plus,
> array_agg is the go-to function for doing this kind of experimentation.

I like this idea, though the examples seemed too detailed so I skipped
them.  Here is the trimmed-down patch I would like to apply.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7c3e940afe..35d8924c6b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20273,6 +20273,18 @@ SELECT NULLIF(value, '(none)') ...
aggregation.
   
 
+  
+   
+The aggregates in this section all accept an optional ORDER
+BY clause as described in the general syntax rules
+for aggregates (see ).
+input_sort_expr specifies the column names to sort
+by before being supplied to the aggreagate function.  For example,
+array_agg(... ORDER BY ...) produces an array with
+ordered values.
+   
+  
+

 General-Purpose Aggregate Functions
 
@@ -20310,7 +20322,7 @@ SELECT NULLIF(value, '(none)') ...
 
  array_agg
 
-array_agg ( anynonarray )
+array_agg ( anynonarray ORDER BY input_sort_expr)
 anyarray


@@ -20321,7 +20333,7 @@ SELECT NULLIF(value, '(none)') ...
 
   

-array_agg ( anyarray )
+array_agg ( anyarray ORDER BY input_sort_expr )
 anyarray


@@ -20356,11 +20368,11 @@ SELECT NULLIF(value, '(none)') ...
 numeric


-avg ( real )
+avg ( real  ORDER BY input_sort_expr  )
 double precision


-avg ( double precision )
+avg ( double precision  ORDER BY input_sort_expr  )
 double precision


@@ -20526,14 +20538,14 @@ SELECT NULLIF(value, '(none)') ...
 
  json_agg
 
-json_agg ( anyelement )
+json_agg ( anyelement ORDER BY input_sort_expr )
 json


 
  jsonb_agg
 
-jsonb_agg ( anyelement )
+jsonb_agg ( anyelement ORDER BY input_sort_expr )
 jsonb


@@ -20573,7 +20585,8 @@ SELECT NULLIF(value, '(none)') ...
 
 json_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_expr )
 json


@@ -20582,7 +20595,8 @@ SELECT NULLIF(value, '(none)') ...
 
 jsonb_object_agg ( key
  "any", value
- "any" )
+ "any"
+ ORDER BY input_sort_expr )
 jsonb


@@ -20819,7 +20833,8 @@ SELECT NULLIF(value, '(none)') ...


 string_agg ( value
- bytea, delimiter bytea )
+ bytea, delimiter bytea
+ ORDER BY input_sort_expr )
 bytea


@@ -20851,11 +20866,11 @@ SELECT NULLIF(value, '(none)') ...
 numeric


-sum ( real )
+sum ( real  ORDER BY input_sort_expr  )
 real


-sum ( double precision )
+sum ( double precision  ORDER BY input_sort_expr  )
 double precision


@@ -20877,7 +20892,7 @@ SELECT NULLIF(value, '(none)') ...
 
  xmlagg
 
-xmlagg ( xml )
+xmlagg ( xml ORDER BY input_sort_expr )
 xml




RFC: Pluggable TOAST

2023-10-24 Thread Nikita Malakhov
Hi hackers!

We need community feedback on previously discussed topic [1].
There are some long-live issues in Postgres related to the TOAST mechanics,
like [2].
Some time ago we already proposed a set of patches with an API allowing to
plug in
different TOAST implementations into a live database. The patch set
introduced a lot
of code and was quite crude in some places, so after several
implementations we decided
to try to implement it in the production environment for further check-up.

The main idea behind pluggable TOAST is make it possible to easily plug in
and use different
implementations of large values storage, preserving existing mechanics to
keep backward
compatibilitну provide easy Postgres-way  give users alternative mechanics
for storing large
column values in a more effective way - we already have custom and very
effective (up to tens
and even hundreds of times faster) TOAST implementations for bytea and
JSONb data types.

As we see it - Pluggable TOAST proposes
1) changes in TOAST pointer itself, extending it to store custom data -
current limitations
of TOAST pointer were discussed in [1] and [4];
2) API which allows calls of custom TOAST implementations for certain table
columns and
(a topic for discussion) certain datatypes.

Custom TOAST could be also used in a not so trivial way - for example,
limited columnar storage could be easily implemented and plugged in without
heavy core modifications
of implementation of Pluggable Storage (Table Access Methods), preserving
existing data
and database structure, be upgraded, replicated and so on.

Any thoughts and proposals are welcome.

[1] Pluggable TOAST
https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru

[2] Infinite loop while acquiring new TOAST Oid
https://www.postgresql.org/message-id/flat/CAN-LCVPRvRzxeUdYdDCZ7UwZQs1NmZpqBUCd%3D%2BRdMPFTyt-bRQ%40mail.gmail.com

[3] JSONB Compression dictionaries
https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com

[4] Extending the TOAST pointer
https://www.postgresql.org/message-id/flat/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com
-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-10-24 Thread Tom Lane
Daniel Gustafsson  writes:
> I went ahead and applied this on master, thanks for review!  Now to see if
> there will be any noticeable difference in resource usage.

I think that tools like Coverity are likely to whine about your
use of sprintf instead of snprintf.  Sure, it's perfectly safe,
but that won't stop the no-sprintf-ever crowd from complaining.

regards, tom lane




Re: Cirrus-ci is lowering free CI cycles - what to do with cfbot, etc?

2023-10-24 Thread Daniel Gustafsson
> On 13 Sep 2023, at 01:49, Andres Freund  wrote:

> My patch increased user/sys time a bit (likely due to a higher number of
> futile psql forks), but Daniel's doesn't. And it does show a nice overall wall
> clock time saving.

I went ahead and applied this on master, thanks for review!  Now to see if
there will be any noticeable difference in resource usage.

--
Daniel Gustafsson





Re: Replace references to malloc() in libpq documentation with generic language

2023-10-24 Thread Daniel Gustafsson
> On 24 Oct 2023, at 17:07, Tom Lane  wrote:
> Daniel Gustafsson  writes:

>> I do agree with this proposed change though:
> 
>> -  all the space that will be freed by .
>> +  all the memory that will be freed by .
> 
> +1, seems harmless.

I've pushed this part, skipping the rest.

--
Daniel Gustafsson





Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2023-10-24 Thread Alexander Korotkov
On Wed, Sep 28, 2022 at 11:44 AM Aleksander Alekseev
 wrote:
> > I think, this patch was marked as "Waiting on Author", probably, by 
> > mistake. Since recent changes were done without any significant code 
> > changes and CF bot how happy again.
> >
> > I'm going to move it to RfC, could I? If not, please tell why.
>
> I restored the "Ready for Committer" state. I don't think it's a good
> practice to change the state every time the patch has a slight
> conflict or something. This is not helpful at all. Such things happen
> quite regularly and typically are fixed in a couple of days.

This patch seems useful to me.  I went through the thread, it seems
that all the critics are addressed.

I've rebased this patch.   Also, I've run perltidy for tests, split
long errmsg() into errmsg(), errdetail() and errhint(), and do other
minor enchantments.

I think this patch is ready to go.  I'm going to push it if there are
no objections.

--
Regards,
Alexander Korotkov


0001-Teach-contrib-amcheck-to-check-the-unique-constr-v18.patch
Description: Binary data


Re: MERGE ... RETURNING

2023-10-24 Thread Jeff Davis
On Wed, 2023-08-23 at 11:58 +0100, Dean Rasheed wrote:
> Updated version attached, fixing an uninitialized-variable warning
> from the cfbot.

I took another look and I'm still not comfortable with the special
IsMergeSupportFunction() functions. I don't object necessarily -- if
someone else wants to commit it, they can -- but I don't plan to commit
it in this form.

Can we revisit the idea of a per-WHEN RETURNING clause? The returning
clauses could be treated kind of like a UNION, which makes sense
because it really is a union of different results (the returned tuples
from an INSERT are different than the returned tuples from a DELETE).
You can just add constants to the target lists to distinguish which
WHEN clause they came from.

I know you rejected that approach early on, but perhaps it's worth
discussing further?

Regards,
Jeff Davis





Re: Row pattern recognition

2023-10-24 Thread Jacob Champion
On Sat, Oct 21, 2023 at 7:39 PM Tatsuo Ishii  wrote:
> Attached is the v10 patch. This version enhances the performance of
> pattern matching.

Nice! I've attached a couple of more stressful tests (window
partitions of 1000 rows each). Beware that the second one runs my
desktop out of memory fairly quickly with the v10 implementation.

I was able to carve out some time this week to implement a very basic
recursive NFA, which handles both the + and * qualifiers (attached).
It's not production quality -- a frame on the call stack for every row
isn't going to work -- but with only those two features, it's pretty
tiny, and it's able to run the new stress tests with no issue. If I
continue to have time, I hope to keep updating this parallel
implementation as you add features to the StringSet implementation,
and we can see how it evolves. I expect that alternation and grouping
will ratchet up the complexity.

Thanks!
--Jacob
From fb3cbb6f99f0fe7b05027759454d7e0013225929 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Fri, 20 Oct 2023 16:11:14 -0700
Subject: [PATCH 2/2] squash! Row pattern recognition patch (executor).

- Extract pattern matching logic into match_pattern().
- Replace the str_set/initials implementation with a recursive
  backtracker.
- Rework match_pattern in preparation for * quantifier: separate success
  from number of matched rows.
- Handle `*` quantifier
- Simplify the evaluate_pattern() signature.
- Remove defineInitial and related code.
---
 src/backend/executor/nodeWindowAgg.c| 871 +++-
 src/backend/optimizer/plan/createplan.c |   4 -
 src/backend/parser/parse_clause.c   |  31 -
 src/include/nodes/execnodes.h   |   1 -
 src/include/nodes/parsenodes.h  |   2 -
 src/include/nodes/plannodes.h   |   3 -
 6 files changed, 91 insertions(+), 821 deletions(-)

diff --git a/src/backend/executor/nodeWindowAgg.c 
b/src/backend/executor/nodeWindowAgg.c
index 2e1baef7ea..30abe60159 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -161,40 +161,6 @@ typedef struct WindowStatePerAggData
boolrestart;/* need to restart this agg in 
this cycle? */
 } WindowStatePerAggData;
 
-/*
- * Set of StringInfo. Used in RPR.
- */
-typedef struct StringSet {
-   StringInfo  *str_set;
-   Sizeset_size;   /* current array allocation size in 
number of items */
-   int set_index;  /* current used size */
-} StringSet;
-
-/*
- * Allowed subsequent PATTERN variables positions.
- * Used in RPR.
- *
- * pos represents the pattern variable defined order in DEFINE caluase.  For
- * example. "DEFINE START..., UP..., DOWN ..." and "PATTERN START UP DOWN UP"
- * will create:
- * VariablePos[0].pos[0] = 0;  START
- * VariablePos[1].pos[0] = 1;  UP
- * VariablePos[1].pos[1] = 3;  UP
- * VariablePos[2].pos[0] = 2;  DOWN
- *
- * Note that UP has two pos because UP appears in PATTERN twice.
- *
- * By using this strucrture, we can know which pattern variable can be followed
- * by which pattern variable(s). For example, START can be followed by UP and
- * DOWN since START's pos is 0, and UP's pos is 1 or 3, DOWN's pos is 2.
- * DOWN can be followed by UP since UP's pos is either 1 or 3. 
- * 
- */
-#define NUM_ALPHABETS  26  /* we allow [a-z] variable initials */
-typedef struct VariablePos {
-   int pos[NUM_ALPHABETS]; /* postion(s) in 
PATTERN */
-} VariablePos;
-
 static void initialize_windowaggregate(WindowAggState *winstate,
   
WindowStatePerFunc perfuncstate,
   
WindowStatePerAgg peraggstate);
@@ -249,27 +215,11 @@ static void register_reduced_frame_map(WindowAggState 
*winstate, int64 pos, int
 static void clear_reduced_frame_map(WindowAggState *winstate);
 static void update_reduced_frame(WindowObject winobj, int64 pos);
 
-static int64 evaluate_pattern(WindowObject winobj, int64 current_pos,
- char *vname, 
StringInfo encoded_str, bool *result);
+static bool evaluate_pattern(WindowObject winobj, int64 current_pos,
+char *vname);
 
 static bool get_slots(WindowObject winobj, int64 current_pos);
 
-//static int   search_str_set(char *pattern, StringSet *str_set, int 
*initial_orders);
-static int search_str_set(char *pattern, StringSet *str_set, VariablePos 
*variable_pos);
-static charpattern_initial(WindowAggState *winstate, char *vname);
-static int do_pattern_match(char *pattern, char *encoded_str);
-
-static StringSet *string_set_init(void);
-static void string_set_add(StringSet *string_set, StringInfo str);
-static StringInfo string_set_get(StringSet *string_set, int index);
-static int 

Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-10-24 Thread Stephen Frost
Greetings,

On Tue, Oct 24, 2023 at 14:42 Robert Haas  wrote:

> On Tue, Oct 24, 2023 at 1:46 PM Jeff Davis  wrote:
> > Perhaps the idea is that if there are constraints involved, the failure
> > or success of an INSERT/UPDATE/DELETE could leak information that you
> > don't have privileges to read.
>
> My recollection of this topic is pretty hazy, but like Tom, I seem to
> remember it being intentional, and I think the reason had something to
> do with wanting the slice of a RLS-protect table that you can see to
> feel like a complete table. When you update a row in a table all of
> which is visible to you, the updated row can never vanish as a result
> of that update, so it was thought, if I remember correctly, that this
> should also be true here. It's also similar to what happens if an
> updatable view has WITH CHECK OPTION, and I think that was part of the
> precedent as well. I don't know whether or not the constraint issue
> that you mention here was also part of the concern, but it may have
> been. This was all quite a while ago...


Yes, having it be similar to a view WITH CHECK OPTION was intentional, also
on not wishing for things to be able to disappear or to not get saved. The
risk of a constraint possibly causing the leak of information is better
than either having data just thrown away or having the constraint not
provide the guarantee it’s supposed to …

Thanks,

Stephen

(On my phone at an event currently, sorry for not digging in deeper on
this..)

>


Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-10-24 Thread Robert Haas
On Tue, Oct 24, 2023 at 1:46 PM Jeff Davis  wrote:
> Perhaps the idea is that if there are constraints involved, the failure
> or success of an INSERT/UPDATE/DELETE could leak information that you
> don't have privileges to read.

My recollection of this topic is pretty hazy, but like Tom, I seem to
remember it being intentional, and I think the reason had something to
do with wanting the slice of a RLS-protect table that you can see to
feel like a complete table. When you update a row in a table all of
which is visible to you, the updated row can never vanish as a result
of that update, so it was thought, if I remember correctly, that this
should also be true here. It's also similar to what happens if an
updatable view has WITH CHECK OPTION, and I think that was part of the
precedent as well. I don't know whether or not the constraint issue
that you mention here was also part of the concern, but it may have
been. This was all quite a while ago...

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-10-24 Thread Jeff Davis
On Tue, 2023-10-24 at 11:59 -0400, Tom Lane wrote:
> I'm fairly sure that it was intentional, but I don't recall the
> reasoning; perhaps Stephen does.  In any case, I grasp your point
> that maybe we should distinguish RETURNING from not-RETURNING cases.

Perhaps the idea is that if there are constraints involved, the failure
or success of an INSERT/UPDATE/DELETE could leak information that you
don't have privileges to read.

Regards,
Jeff Davis





Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-24 Thread Nathan Bossart
On Tue, Oct 24, 2023 at 06:04:13PM +0200, Alvaro Herrera wrote:
> Like everybody else, I like having less GUCs to configure, but going
> this far to avoid them looks rather disastrous to me.  IMO we should
> just use Munro's older patches that gave one GUC per SLRU, and users
> only need to increase the one that shows up in pg_wait_event sampling.
> Someday we will get the (much more complicated) patches to move these
> buffers to steal memory from shared buffers, and that'll hopefully let
> use get rid of all this complexity.

+1

> On the other hand, here's a somewhat crazy idea.  What if, instead of
> stealing buffers from shared_buffers (which causes a lot of complexity),
> we allocate a common pool for all SLRUs to use?  We provide a single
> knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
> algorithm (or something) to distribute that memory across all the SLRUs.
> So the ratio to use for this SLRU or that one would depend on the nature
> of the workload: maybe more for multixact in this server here, but more
> for subtrans in that server there; it's just the total amount that the
> user would have to configure, side by side with shared_buffers (and
> perhaps scale with it like wal_buffers), and the LRU would handle the
> rest.  The "only" problem here is finding a distribution algorithm that
> doesn't further degrade performance, of course ...

I think it's worth a try.  It does seem simpler, and it might allow us to
sidestep some concerns about scaling when the SLRU pages are in
shared_buffers [0].

[0] https://postgr.es/m/ZPsaEGRvllitxB3v%40tamriel.snowman.net

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-24 Thread Jelte Fennema
On Tue, 24 Oct 2023 at 18:47, Nathan Bossart  wrote:
> BTW after applying your patches, initdb began failing with the following
> for me:
>
> TRAP: failed Assert("n >= 0 && n < list->length"), File: "list.c", 
> Line: 770, PID: 902807

Oh oops... That was an off by one error in the modified
foreach_delete_current implementation.
Attached is a fixed version.


v2-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch
Description: Binary data


v2-0001-Add-macros-for-looping-through-a-list-without-nee.patch
Description: Binary data


Re: LLVM 16 (opaque pointers)

2023-10-24 Thread Mark Wong
On Mon, Oct 23, 2023 at 10:47:24PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > FC 29 is well out of support, so I don't think it makes sense to invest any
> > further time in this. Personally, I don't think it's useful to have years 
> > old
> > fedora in the buildfarm...
> 
> +1.  It's good to test old LTS distros, but Fedora releases have a
> short shelf life by design.

I'll start retiring those old Fedora ones I have. :)

Regards,
Mark




Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-24 Thread Nathan Bossart
On Tue, Oct 24, 2023 at 06:03:48PM +0200, Jelte Fennema wrote:
> Many usages of the foreach macro in the Postgres codebase only use the
> ListCell variable to then get its value. This adds macros that
> simplify iteration code for that very common use case. Instead of
> passing a ListCell you can pass a variable of the type of its
> contents. This IMHO improves readability of the code by reducing the
> total amount of code while also essentially forcing the use of useful
> variable names.
> 
> While this might seem like a small quality of life improvement, in
> practice it turns out to be very nice to use. At Microsoft we have
> been using macros very similar to these ones in the Citus codebase for
> a long time now and we pretty much never use plain foreach anymore for
> new code.

This seems reasonable to me.

> Finally, I guess there needs to be some bikeshedding on the naming. In
> the Citus codebase we call them foreach_xyz instead of the
> for_each_xyz naming pattern that is used in this patchset. I'm not
> sure what the current stance is on if foreach should be written with
> or without an underscore between for and each. Currently pg_list.h
> uses both.

I don't have a strong opinion on the matter, but if I had to choose, I
guess I'd pick foreach_*() because these macros are most closely related to
foreach().

BTW after applying your patches, initdb began failing with the following
for me:

TRAP: failed Assert("n >= 0 && n < list->length"), File: "list.c", 
Line: 770, PID: 902807

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Adding a pg_get_owned_sequence function?

2023-10-24 Thread Nathan Bossart
On Tue, Sep 12, 2023 at 03:53:28PM +0100, Dagfinn Ilmari Mannsåker wrote:
> Tom Lane  writes:
>> It's possible that we could get away with just summarily changing
>> the argument type from text to regclass.  ISTR that we did exactly
>> that with nextval() years ago, and didn't get too much push-back.
>> But we couldn't do the same for the return type.  Also, this
>> approach does nothing for the concern about the name being
>> misleading.
> 
> Maybe we should go all the way the other way, and call it
> pg_get_identity_sequence() and claim that "serial" is a legacy form of
> identity columns?

Hm.  Could we split it into two functions, pg_get_owned_sequence() and
pg_get_identity_sequence()?  I see that commit 3012061 [0] added support
for identity columns to pg_get_serial_sequence() because folks expected
that to work, so maybe that's a good reason to keep them together.  If we
do elect to keep them combined, I'd be okay with renaming it to
pg_get_identity_sequence() along with your other proposed changes.

[0] https://postgr.es/m/20170912212054.25640.55202%40wrigleys.postgresql.org

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: trying again to get incremental backup

2023-10-24 Thread Robert Haas
On Tue, Oct 24, 2023 at 10:53 AM Peter Eisentraut  wrote:
> The easiest answer is to have it off by default.  Let people figure out
> what works for them.  There are various factors like storage, network,
> server performance, RTO that will determine what combination of full
> backup, incremental backup, and WAL replay will satisfy someone's
> requirements.  I suppose tests could be set up to determine this to some
> degree.  But we could also start slow and let people figure it out
> themselves.  When pg_basebackup was added, it was also disabled by default.
>
> If we think that 7d is a good setting, then I would suggest to consider,
> like 10d.  Otherwise, if you do a weekly incremental backup and you have
> a time change or a hiccup of some kind one day, you lose your backup
> sequence.
>
> Another possible answer is, like, 400 days?  Because why not?  What is a
> reasonable upper limit for this?

In concept, I don't think this should even be time-based. What you
should do is remove WAL summaries once you know that you've taken as
many incremental backups that might use them as you're ever going to
do. But PostgreSQL itself doesn't have any way of knowing what your
intended backup patterns are. If your incremental backup fails on
Monday night and you run it manually on Tuesday morning, you might
still rerun it as an incremental backup. If it fails every night for a
month and you finally realize and decide to intervene manually, maybe
you want a new full backup at that point. It's been a month. But on
the other hand maybe you don't. There's no time-based answer to this
question that is really correct, and I think it's quite possible that
your backup software might want to shut off time-based deletion
altogether and make its own decisions about when to nuke summaries.
However, I also don't think that's a great default setting. It could
easily lead to people wasting a bunch of disk space for no reason.

As far as the 7d value, I figured that nighty incremental backups
would be fairly common. If we think weekly incremental backups would
be common, then pushing this out to 10d would make sense. While
there's no reason you couldn't take an annual incremental backup, and
thus want a 400d value, it seems like a pretty niche use case.

Note that whether to remove summaries is a separate question from
whether to generate them in the first place. Right now, I have
wal_summarize_mb controlling whether they get generated in the first
place, but as I noted in another recent email, that isn't an entirely
satisfying solution.

> It looks like each file entry in the manifest takes about 150 bytes, so
> 1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?

I suspect a few people have more files than that. They'll just have to
wait to use this feature until we get incremental JSON parsing (or
undo the decision to use JSON for the manifest).

> The current user experience of pg_basebackup is that it waits possibly a
> long time for a checkpoint, and there is an option to make it go faster,
> but there is no timeout AFAICT.  Is this substantially different?  Could
> we just let it wait forever?

We could. I installed the timeout because the first versions of the
feature were buggy, and I didn't like having my tests hang forever
with no indication of what had gone wrong. At least in my experience
so far, the time spent waiting for WAL summarization is typically
quite short, because only the WAL that needs to be summarized is
whatever was emitted since the last time it woke up up through the
start LSN of the backup. That's probably not much, and the next time
the summarizer wakes up, the file should appear within moments. So,
it's a little different from the checkpoint case, where long waits are
expected.

> Also, does waiting for checkpoint and WAL summarization happen in
> parallel?  If so, what if it starts a checkpoint that might take 15 min
> to complete, and then after 60 seconds it kicks you off because the WAL
> summarization isn't ready.  That might be wasteful.

It is not parallel. The trouble is, we don't really have any way to
know whether WAL summarization is going to fail for whatever reason.
We don't expect that to happen, but if somebody changes the
permissions on the WAL summary directory or attaches gdb to the WAL
summarizer process or something of that sort, it might.

We could check at the outset whether we seem to be really far behind
and emit a warning. For instance, if we're 1TB behind on WAL
summarization when the checkpoint is requested, chances are something
is busted and we're probably not going to catch up any time soon. We
could warn the user about that and let them make their own decision
about whether to cancel. But, that idea won't help in unattended
operation, and the threshold for "really far behind" is not very
clear. It might be better to wait until we get more experience with
how things actually fail before doing too much engineering here, but
I'm also open to 

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-24 Thread Alvaro Herrera
On 2023-Oct-11, Dilip Kumar wrote:

> In my last email, I forgot to give the link from where I have taken
> the base path for dividing the buffer pool in banks so giving the same
> here[1].  And looking at this again it seems that the idea of that
> patch was from Andrey M. Borodin and the idea of the SLRU scale factor
> were introduced by Yura Sokolov and Ivan Lazarev.  Apologies for
> missing that in the first email.

You mean [1].
[1] https://postgr.es/m/452d01f7e331458f56ad79bef537c31b%40postgrespro.ru
I don't like this idea very much, because of the magic numbers that act
as ratios for numbers of buffers on each SLRU compared to other SLRUs.
These values, which I took from the documentation part of the patch,
appear to have been selected by throwing darts at the wall:

NUM_CLOG_BUFFERS= Min(128 << slru_buffers_size_scale, 
shared_buffers/256)
NUM_COMMIT_TS_BUFFERS   = Min(128 << slru_buffers_size_scale, 
shared_buffers/256)
NUM_SUBTRANS_BUFFERS= Min(64 << slru_buffers_size_scale, 
shared_buffers/256)
NUM_NOTIFY_BUFFERS  = Min(32 << slru_buffers_size_scale, 
shared_buffers/256)
NUM_SERIAL_BUFFERS  = Min(32 << slru_buffers_size_scale, 
shared_buffers/256)
NUM_MULTIXACTOFFSET_BUFFERS = Min(32 << slru_buffers_size_scale, 
shared_buffers/256)
NUM_MULTIXACTMEMBER_BUFFERS = Min(64 << slru_buffers_size_scale, 
shared_buffers/256)

... which look pretty random already, if similar enough to the current
hardcoded values.  In reality, the code implements different values than
what the documentation says.

I don't see why would CLOG have the same number as COMMIT_TS, when the
size for elements of the latter is like 32 times bigger -- however, the
frequency of reads for COMMIT_TS is like 1000x smaller than for CLOG.
SUBTRANS is half of CLOG, yet it is 16 times larger, and it covers the
same range.  The MULTIXACT ones appear to keep the current ratio among
them (8/16 gets changed to 32/64).

... and this whole mess is scaled exponentially without regard to the
size that each SLRU requires.  This is just betting that enough memory
can be wasted across all SLRUs up to the point where the one that is
actually contended has sufficient memory.  This doesn't sound sensible
to me.

Like everybody else, I like having less GUCs to configure, but going
this far to avoid them looks rather disastrous to me.  IMO we should
just use Munro's older patches that gave one GUC per SLRU, and users
only need to increase the one that shows up in pg_wait_event sampling.
Someday we will get the (much more complicated) patches to move these
buffers to steal memory from shared buffers, and that'll hopefully let
use get rid of all this complexity.


I'm inclined to use Borodin's patch last posted here [2] instead of your
proposed 0001.
[2] https://postgr.es/m/93236d36-b91c-4dfa-af03-99c083840...@yandex-team.ru

I did skim patches 0002 and 0003 without going into too much detail;
they look reasonable ideas.  I have not tried to reproduce the claimed
performance benefits.  I think measuring this patch set with the tests
posted by Shawn Debnath in [3] is important, too.
[3] https://postgr.es/m/yemddpmrsojfq...@f01898859afd.ant.amazon.com


On the other hand, here's a somewhat crazy idea.  What if, instead of
stealing buffers from shared_buffers (which causes a lot of complexity),
we allocate a common pool for all SLRUs to use?  We provide a single
knob -- say, non_relational_buffers=32MB as default -- and we use a LRU
algorithm (or something) to distribute that memory across all the SLRUs.
So the ratio to use for this SLRU or that one would depend on the nature
of the workload: maybe more for multixact in this server here, but more
for subtrans in that server there; it's just the total amount that the
user would have to configure, side by side with shared_buffers (and
perhaps scale with it like wal_buffers), and the LRU would handle the
rest.  The "only" problem here is finding a distribution algorithm that
doesn't further degrade performance, of course ...

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
  -- Paul Graham, http://www.paulgraham.com/opensource.html




Add new for_each macros for iterating over a List that do not require ListCell pointer

2023-10-24 Thread Jelte Fennema
Many usages of the foreach macro in the Postgres codebase only use the
ListCell variable to then get its value. This adds macros that
simplify iteration code for that very common use case. Instead of
passing a ListCell you can pass a variable of the type of its
contents. This IMHO improves readability of the code by reducing the
total amount of code while also essentially forcing the use of useful
variable names.

While this might seem like a small quality of life improvement, in
practice it turns out to be very nice to use. At Microsoft we have
been using macros very similar to these ones in the Citus codebase for
a long time now and we pretty much never use plain foreach anymore for
new code.

Finally, I guess there needs to be some bikeshedding on the naming. In
the Citus codebase we call them foreach_xyz instead of the
for_each_xyz naming pattern that is used in this patchset. I'm not
sure what the current stance is on if foreach should be written with
or without an underscore between for and each. Currently pg_list.h
uses both.

P.S. Similar macros for forboth/forthree are also possible, but
require an exponential macro count handle all different possibilities,
which might not be worth the effort since forboth/forthree are used
much less often than foreach. In Citus we do have 3 forboth macros
that don't require ListCell for the most common cases (foreach_ptr,
foreach_ptr_oid, foreach_int_oid). But I did not want to clutter this
patchset with that discussion.


v1-0001-Add-macros-for-looping-through-a-list-without-nee.patch
Description: Binary data


v1-0002-Use-new-for_each_xyz-macros-in-a-few-places.patch
Description: Binary data


Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-10-24 Thread Tom Lane
Dean Rasheed  writes:
> On Tue, 24 Oct 2023 at 09:36, Laurenz Albe  wrote:
>> I'd say that this error is wrong.  The FOR SELECT policy should be applied
>> to the WHERE condition, but certainly not to check new rows.

> Yes, I had the same thought recently. I would say that the SELECT
> policies should only be used to check new rows if the UPDATE has a
> RETURNING clause and SELECT permissions are required on the target
> relation.

> In other words, it should be OK to UPDATE a row to new values that are
> not visible according to the table's SELECT policies, provided that
> the UPDATE command does not attempt to return those new values. That
> would be consistent with what we do for INSERT.

> Note, that the current behaviour goes back a long way, though it's not
> quite clear whether this was intentional [1].

I'm fairly sure that it was intentional, but I don't recall the
reasoning; perhaps Stephen does.  In any case, I grasp your point
that maybe we should distinguish RETURNING from not-RETURNING cases.

regards, tom lane




Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges

2023-10-24 Thread Tom Lane
Christoph Berg  writes:
> Anyway, if this doesn't raise any "oh we didn't think of this"
> concerns, we'll just remove the old operators in pgsphere.

Well, the idea was exactly to forbid that sort of setup.
However, if we get sufficient pushback maybe we should
reconsider --- for example, maybe it'd be sane to enforce
the restriction in ALTER but not CREATE?

I'm inclined to wait and see if there are more complaints.

regards, tom lane




Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges

2023-10-24 Thread Christoph Berg
Re: Tom Lane
> > We might be able to simply delete the @ operators, but doesn't this
> > new check break the general possibility to have more than one spelling
> > for the same operator?
> 
> You can have more than one operator atop the same function.
> But why didn't you make the @ operators commutators of each other,
> rather than this mess?

Historical something.

You are right that the commutators could be fixed that way, but the
negators are a different question. There is no legacy spelling for
these.

Anyway, if this doesn't raise any "oh we didn't think of this"
concerns, we'll just remove the old operators in pgsphere.

Christoph




RE: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-10-24 Thread José Neves
Hi there, hope to find you well.

I have a follow-up question to this already long thread.

Upon deploying my PostgreSQL logical replication fed application on a stale 
database, I ended up running out of space, as the replication slot is being 
held back till the next time that we receive a data-changing event, and we 
advance to that new LSN offset.
I think that the solution for this is to advance our LSN offset every time a 
keep-alive message is received ('k' // 107).
My doubt is, can the keep-alive messages be received in between open 
transaction events? I think not, but I would like to get your input to be extra 
sure as if this happens, and I commit that offset, I may introduce again faulty 
logic leading to data loss.

In sum, something like this wouldn't happen:
BEGIN LSN001
INSERT LSN002
KEEP LIVE LSN003
UPDATE LSN004
COMMIT LSN005

Correct? It has to be either:

KEEP LIVE LSN001
BEGIN LSN002
INSERT LSN003
UPDATE LSN004
COMMIT LSN005

Or:
BEGIN LSN001
INSERT LSN002
UPDATE LSN004
COMMIT LSN005
KEEP LIVE LSN006

LSNXXX are mere representations of LSN offsets.

Thank you again.
Regards,
José Neves


De: Amit Kapila 
Enviado: 8 de agosto de 2023 14:37
Para: José Neves 
Cc: Andres Freund ; pgsql-hack...@postgresql.org 

Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom 
client

On Mon, Aug 7, 2023 at 1:46 PM José Neves  wrote:
>
> Humm, that's... challenging. I faced some issues after "the fix" because I 
> had a couple of transactions with 25k updates, and I had to split it to be 
> able to push to our event messaging system, as our max message size is 10MB. 
> Relying on commit time would mean that all transaction operations will have 
> the same timestamp. If something goes wrong while my worker is pushing that 
> transaction data chunks, I will duplicate some data in the next run, so... 
> this wouldn't allow me to deal with data duplication.
> Is there any other way that you see to deal with it?
>
> Right now I only see an option, which is to store all processed LSNs on the 
> other side of the ETL. I'm trying to avoid that overhead.
>

Sorry, I don't understand your system enough to give you suggestions
but if you have any questions related to how logical replication work
then I might be able to help.

--
With Regards,
Amit Kapila.


Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges

2023-10-24 Thread Tom Lane
Christoph Berg  writes:
> This change is breaking pgsphere which has <@ @> operator pairs, but
> for historical reasons also includes alternative spellings of these
> operators (both called @ with swapped operand types) which now
> explodes because we can't add them with the "proper" commutator and
> negators declared (which point to the canonical <@ @> !<@ !@>
> operators).

Should have guessed that somebody might be depending on the previous
squishy behavior.  Still, I can't see how the above situation is a
good idea.  Commutators/negators should come in pairs, not have
completely random links.  I think it's only accidental that this
setup isn't triggering other strange behavior.

> We might be able to simply delete the @ operators, but doesn't this
> new check break the general possibility to have more than one spelling
> for the same operator?

You can have more than one operator atop the same function.
But why didn't you make the @ operators commutators of each other,
rather than this mess?

regards, tom lane




Re: Replace references to malloc() in libpq documentation with generic language

2023-10-24 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 24 Oct 2023, at 07:13, Gurjeet Singh  wrote:
>> The user does not benefit from knowing that libpq allocates some/all memory
>> using malloc(). Mentioning malloc() here has a few downsides, and almost no
>> benefits.

> I'm not entirely convinced that replacing "malloc" with "allocated on the 
> heap"
> improves the documentation.

That was my reaction too.  The underlying storage allocator *is* malloc,
and C programmers know what that is, and I don't see how obfuscating
that improves matters.  It's true that on the miserable excuse for a
platform that is Windows, you have to use PQfreemem because of
Microsoft's inability to supply a standards-compliant implementation
of malloc.  But I'm not inclined to let that tail wag the dog.

> I do agree with this proposed change though:

> -  all the space that will be freed by .
> +  all the memory that will be freed by .

+1, seems harmless.

regards, tom lane




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-24 Thread Peter Eisentraut

On 24.10.23 14:40, Julien Rouhaud wrote:

On Tue, Oct 24, 2023 at 6:57 PM Peter Eisentraut  wrote:


On 24.10.23 09:58, Andrei Zubkov wrote:

During last moving to the current commitfest this patch have lost its
reviewers list. With respect to reviewers contribution in this patch, I
think reviewers list should be fixed.


I don't think it's the purpose of the commitfest app to track how *has*
reviewed a patch.  The purpose is to plan and allocate *current* work.
If we keep a bunch of reviewers listed on a patch who are not actually
reviewing the patch, then that effectively blocks new reviewers from
signing up and the patch will not make progress.

Past reviewers should of course be listed in the commit message, the
release notes, etc. as appropriate.


Really?  Last time this topic showed up at least one committer said
that they tend to believe the CF app more than digging the thread [1],
and some other hackers mentioned other usage for being kept in the
reviewer list.  Since no progress has been made on the CF app since
I'm not sure it's the best idea to drop reviewer names from patch
entries generally.


There is a conflict between the two purposes.  But it is clearly the 
case that reviewers will more likely pick up patches that have no 
reviewers assigned.  So if you keep stale reviewer entries around, then 
a patch that stays around for a while will never get reviewed again.  I 
think this is a significant problem at the moment, and I made it part of 
my mission during the last commitfest to clean it up.  If people want to 
put the stale reviewer entries back in, that is possible, but I would 
caution against that, because that would just self-sabotage those patches.






Re: trying again to get incremental backup

2023-10-24 Thread Peter Eisentraut

On 04.10.23 22:08, Robert Haas wrote:

- I would like some feedback on the generation of WAL summary files.
Right now, I have it enabled by default, and summaries are kept for a
week. That means that, with no additional setup, you can take an
incremental backup as long as the reference backup was taken in the
last week. File removal is governed by mtimes, so if you change the
mtimes of your summary files or whack your system clock around, weird
things might happen. But obviously this might be inconvenient. Some
people might not want WAL summary files to be generated at all because
they don't care about incremental backup, and other people might want
them retained for longer, and still other people might want them to be
not removed automatically or removed automatically based on some
criteria other than mtime. I don't really know what's best here. I
don't think the default policy that the patches implement is
especially terrible, but it's just something that I made up and I
don't have any real confidence that it's wonderful.


The easiest answer is to have it off by default.  Let people figure out 
what works for them.  There are various factors like storage, network, 
server performance, RTO that will determine what combination of full 
backup, incremental backup, and WAL replay will satisfy someone's 
requirements.  I suppose tests could be set up to determine this to some 
degree.  But we could also start slow and let people figure it out 
themselves.  When pg_basebackup was added, it was also disabled by default.


If we think that 7d is a good setting, then I would suggest to consider, 
like 10d.  Otherwise, if you do a weekly incremental backup and you have 
a time change or a hiccup of some kind one day, you lose your backup 
sequence.


Another possible answer is, like, 400 days?  Because why not?  What is a 
reasonable upper limit for this?



- It's regrettable that we don't have incremental JSON parsing; I
think that means anyone who has a backup manifest that is bigger than
1GB can't use this feature. However, that's also a problem for the
existing backup manifest feature, and as far as I can see, we have no
complaints about it. So maybe people just don't have databases with
enough relations for that to be much of a live issue yet. I'm inclined
to treat this as a non-blocker,


It looks like each file entry in the manifest takes about 150 bytes, so 
1 GB would allow for 1024**3/150 = 7158278 files.  That seems fine for now?



- Right now, I have a hard-coded 60 second timeout for WAL
summarization. If you try to take an incremental backup and the WAL
summaries you need don't show up within 60 seconds, the backup times
out. I think that's a reasonable default, but should it be
configurable? If yes, should that be a GUC or, perhaps better, a
pg_basebackup option?


The current user experience of pg_basebackup is that it waits possibly a 
long time for a checkpoint, and there is an option to make it go faster, 
but there is no timeout AFAICT.  Is this substantially different?  Could 
we just let it wait forever?


Also, does waiting for checkpoint and WAL summarization happen in 
parallel?  If so, what if it starts a checkpoint that might take 15 min 
to complete, and then after 60 seconds it kicks you off because the WAL 
summarization isn't ready.  That might be wasteful.



- I'm curious what people think about the pg_walsummary tool that is
included in 0006. I think it's going to be fairly important for
debugging, but it does feel a little bit bad to add a new binary for
something pretty niche.


This seems fine.

Is the WAL summary file format documented anywhere in your patch set 
yet?  My only thought was, maybe the file format could be human-readable 
(more like backup_label) to avoid this.  But maybe not.






Should Explain show Parallel Hash node’s total rows?

2023-10-24 Thread Zhang Mingli

Hi, all

Shall we show Parallel Hash node’s total rows of a Parallel-aware HashJoin?

Ex: a non-parallel plan,  table simple has 2 rows.

zml=# explain  select count(*) from simple r join simple s using (id);
   QUERY PLAN

 Aggregate  (cost=1309.00..1309.01 rows=1 width=8)
   ->  Hash Join  (cost=617.00..1259.00 rows=2 width=0)
 Hash Cond: (r.id  = s.id )
 ->  Seq Scan on simple r  (cost=0.00..367.00 rows=2 width=4)
 ->  Hash  (cost=367.00..367.00 rows=2 width=4)
   ->  Seq Scan on simple s  (cost=0.00..367.00 rows=2 width=4)
(6 rows)

While a parallel-aware plan:

zml=# explain  select count(*) from simple r join simple s using (id);
 QUERY PLAN

 Finalize Aggregate  (cost=691.85..691.86 rows=1 width=8)
   ->  Gather  (cost=691.63..691.84 rows=2 width=8)
 Workers Planned: 2
 ->  Partial Aggregate  (cost=691.63..691.64 rows=1 width=8)
   ->  Parallel Hash Join  (cost=354.50..670.80 rows=8333 width=0)
 Hash Cond: (r.id  = s.id )
 ->  Parallel Seq Scan on simple r  (cost=0.00..250.33 
rows=8333 width=4)
 ->  Parallel Hash  (cost=250.33..250.33 rows=8333 width=4)
   ->  Parallel Seq Scan on simple s  
(cost=0.00..250.33 rows=8333 width=4)
(9 rows)

When initial_cost_hashjoin(), we undo the parallel division when parallel ware.
It’s reasonable because a shared hash table should have all the data.
And we also take parallel into account for hash plan’s total rows if it’s 
parallel aware.
```
 if (best_path->jpath.path.parallel_aware)
{
  hash_plan->plan.parallel_aware = true;
  hash_plan->rows_total = best_path->inner_rows_total;
}
```

But the Parallel Hash node of plan shows the same rows with subplan, I’m 
wandering if it’s more reasonable to show rows_total instead of plan_rows for 
Parallel Hash nodes?

For this example,
  -> Parallel Hash (rows=2)
-> Parallel Seq Scan on simple s (rows=8333)



Zhang Mingli
HashData https://www.hashdata.xyz



Re: run pgindent on a regular basis / scripted manner

2023-10-24 Thread Andrew Dunstan



On 2023-10-18 We 05:07, Jelte Fennema wrote:

I think --enable-indent-checks sounds like a good improvement to the
status quo. But I'm not confident that it will help remove the cases
where only a comment needs to be re-indented. Do commiters really
always run check-world again when only changing a typo in a comment? I
know I probably wouldn't (or at least not always).



Yeah. In fact I'm betting that a lot of the offending commits we've seen 
come into this category. You build, you check, then you do some final 
polish. That's where a pre-commit hook can save you.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges

2023-10-24 Thread Christoph Berg
Re: Tommy Pavlicek
> I've added another patch (0002-require_unused_neg_com-v1.patch) that
> prevents using a commutator or negator that's already part of a pair.

Hmm. I agree with the general idea of adding sanity checks, but this
might be overzealous:

This change is breaking pgsphere which has <@ @> operator pairs, but
for historical reasons also includes alternative spellings of these
operators (both called @ with swapped operand types) which now
explodes because we can't add them with the "proper" commutator and
negators declared (which point to the canonical <@ @> !<@ !@>
operators).

https://github.com/postgrespro/pgsphere/blob/master/pgs_moc_compat.sql.in

We might be able to simply delete the @ operators, but doesn't this
new check break the general possibility to have more than one spelling
for the same operator?

Christoph




Re: run pgindent on a regular basis / scripted manner

2023-10-24 Thread Bruce Momjian
On Tue, Oct 24, 2023 at 09:40:01AM -0400, Andrew Dunstan wrote:
> Slightly off topic, but apropos your message, maybe we should recommend a
> standard git commit template.

I use this and then automatically remove any sections that are empty.

---


|--- gitweb subject length limit |-email limit-|


Reported-by:

Diagnosed-by:

Bug:

Discussion:

Author:

Reviewed-by:

Tested-by:

Backpatch-through:

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: run pgindent on a regular basis / scripted manner

2023-10-24 Thread Andrew Dunstan



On 2023-10-17 Tu 09:52, Robert Haas wrote:

On Tue, Oct 17, 2023 at 6:34 AM Jelte Fennema  wrote:

I think *it is* dead easy to comply. If you run the following commands
before committing/after rebasing, then koel should always be happy:

src/tools/pgindent/pgindent src # works always but a bit slow
src/tools/pgindent/pgindent $(git diff --name-only --diff-filter=ACMR)
# much faster, but only works if you DID NOT change typedefs.list

In isolation, that's true, but the list of mistakes that you can make
while committing which will inconvenience everyone working on the
project is very long. Another one that comes up frequently is
forgetting to bump CATALOG_VERSION_NO, but you also need a good commit
message, and good comments, and a good Discussion link in the commit
message, and the right list of authors and reviewers, and to update
the docs (with spaces, not tabs) and the Makefiles (with tabs, not
spaces) and the meson stuff and, as if that weren't enough already,
you actually need the code to work! And that includes not only working
regularly but also with CLOBBER_CACHE_ALWAYS and debug_parallel_query
and so on. It's very easy to miss something somewhere. I put a LOT of
work into polishing my commits before I push them, and it's still not
that uncommon that I screw something up.



Yes, there's a lot to look out for, and you're a damn sight better at it 
than I am. But we should try to automate the things that can be 
automated, even if that leaves many tasks that can't be. I have three 
things in my pre-commit hook: a check for catalog updates, a check for 
new typedefs, and an indent check. And every one of them has saved me 
from doing things I should not be doing. They aren't perfect but they 
are useful.


Slightly off topic, but apropos your message, maybe we should recommend 
a standard git commit template.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: More new SQL/JSON item methods

2023-10-24 Thread Andrew Dunstan


On 2023-10-19 Th 02:06, Jeevan Chalke wrote:

Thanks, Peter for the comments.

On Fri, Oct 6, 2023 at 5:13 PM Peter Eisentraut  
wrote:


On 29.08.23 09:05, Jeevan Chalke wrote:
> v1-0001-Implement-jsonpath-.bigint-.integer-and-.number-m.patch
>
> This commit implements jsonpath .bigint(), .integer(), and .number()
> methods.  The JSON string or a numeric value is converted to the
> bigint, int4, and numeric type representation.

A comment that applies to all of these: These add various keywords,
switch cases, documentation entries in some order.  Are we happy with
that?  Should we try to reorder all of that for better
maintainability
or readability?


Yeah, that's the better suggestion. While implementing these methods, 
I was confused about where to put them exactly and tried keeping them 
in some logical place.
I think once these methods get in, we can have a follow-up patch 
reorganizing all of these.



I think it would be better to organize things how we want them before 
adding in more stuff.



cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Add trailing commas to enum definitions

2023-10-24 Thread Andrew Dunstan



On 2023-10-23 Mo 17:04, Tom Lane wrote:

Nathan Bossart  writes:

 From a long-term perspective, I
think standardizing on the trailing comma style will actually improve
git-blame because patches won't need to add a comma to the previous line
when adding a value.

Yeah, that's a good point.  I had been leaning towards "this is
unnecessary churn", but with that idea I'm now +1.





+1. It's a fairly common practice in Perl code, too, and I often do it 
for exactly this reason.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Various bugs if segment containing redo pointer does not exist

2023-10-24 Thread Robert Haas
On Mon, Oct 23, 2023 at 8:43 PM Andres Freund  wrote:
> The source of the emode=13=DEBUG2 is that that's hardcoded in
> WaitForWALToBecomeAvailable(). I guess the error ought to come from
> XLogPageRead(), but all that happens is this:
>
> case XLREAD_FAIL:
> if (readFile >= 0)
> close(readFile);
> readFile = -1;
> readLen = 0;
> readSource = XLOG_FROM_ANY;
> return XLREAD_FAIL;

I've been under the impression that the guiding principle here is that
we shouldn't error out upon hitting a condition that should only cause
us to switch sources. I think WaitForWALToBecomeAvailable() is
supposed to set things up so that XLogPageRead()'s call to pg_pread()
will succeed. If it says it can't, then XLogPageRead() is only obliged
to pass that information up to the caller, who can decide to wait
longer for the data to show up, or give up, or whatever it wants to
do. On the other hand, if WaitForWALToBecomeAvailable() says that it's
fine to go ahead and call pg_pread() and pg_pread() then fails, then
that means that we've got a problem with the WAL file other than it
just not being available yet, like it's the wrong length or there was
an I/O error, and those are reportable errors. Said differently, in
the former case, the WAL is not broken, merely not currently
available; in the latter case, it's broken.

The legibility and maintainability of this code are certainly not
great. The xlogreader mechanism is extremely useful, but maybe we
should have done more cleanup of the underlying mechanism first. It's
a giant ball of spaghetti code that is challenging to understand and
almost impossible to validate thoroughly (as you just discovered).

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Query execution in Perl TAP tests needs work

2023-10-24 Thread Andrew Dunstan


On 2023-10-18 We 11:47, Tom Lane wrote:

Robert Haas  writes:

On Wed, Oct 18, 2023 at 10:28 AM Tom Lane  wrote:

I did a bit of research on this on my favorite platforms, and did
not like the results:

Hmm. That's unfortunate. Is perl -MCPAN -e 'install Platypus::FFI' a
viable alternative?

Probably, see my followup.





Interesting. OK, here's an attempt to push the cart a bit further down 
the road. The attached module wraps quite a lot of libpq, at least 
enough for most of the cases we would be interested in, I think. It also 
exports some constants such as connection status values, query status 
values, transaction status values and type Oids. It also makes the 
absence of FFI::Platypus not instantly fatal, but any attempt to use one 
of the wrapped functions will die with a message about the module being 
missing if it's not found.


I guess the next step would be for someone to locate some of the 
hotspots in the TAP tests and try to convert them to using persistent 
connections with this gadget or similar and see how much faster we can 
make them.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com


PqFFI.pm
Description: Perl program


Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-24 Thread Julien Rouhaud
On Tue, Oct 24, 2023 at 6:57 PM Peter Eisentraut  wrote:
>
> On 24.10.23 09:58, Andrei Zubkov wrote:
> > During last moving to the current commitfest this patch have lost its
> > reviewers list. With respect to reviewers contribution in this patch, I
> > think reviewers list should be fixed.
>
> I don't think it's the purpose of the commitfest app to track how *has*
> reviewed a patch.  The purpose is to plan and allocate *current* work.
> If we keep a bunch of reviewers listed on a patch who are not actually
> reviewing the patch, then that effectively blocks new reviewers from
> signing up and the patch will not make progress.
>
> Past reviewers should of course be listed in the commit message, the
> release notes, etc. as appropriate.

Really?  Last time this topic showed up at least one committer said
that they tend to believe the CF app more than digging the thread [1],
and some other hackers mentioned other usage for being kept in the
reviewer list.  Since no progress has been made on the CF app since
I'm not sure it's the best idea to drop reviewer names from patch
entries generally.

[1] https://www.postgresql.org/message-id/552155.1648737...@sss.pgh.pa.us




make pg_ctl start more friendly

2023-10-24 Thread Crisp Lee
I tried PITR recovery, and the 'recovery_target_action' guc is shutdown. I
got a failure, and it told me to check the log, finally I found the result
was due to guc. I think pg_ctl should print some information which told
users recovery had been done.
I developed a commit in my workspace. The steps below:
1. postmaster exits with code 3 if startup shutdowns because of recovery
target action
2. add enum POSTMASER_RECOVERY_SHUTDOWN in pg_ctl
3. print information to stderr if the postmaster's exit code is 3
I test, and it's ok.
I think this information is very useful, especially for some beginners. A
good project not only needs performance, but also ease-of-use.


Re: RFC: Logging plan of the running query

2023-10-24 Thread torikoshia

On 2023-10-24 16:12, Étienne BERSAC wrote:

Hi,

Yeah, and when we have a situation where we want to run
pg_log_query_plan(), we can run it in any environment as long as it
is bundled with the core.


Is it possible to get the plan of a backend programmatically without
access to the logs?

Something like pg_get_query_plan(pid, format) which output would be the
same as EXPLAIN.


Do you imagine a function like below?

  =# select pg_get_query_plan(100, 'plain');
QUERY PLAN
  ---
   Limit  (cost=0.00..0.04 rows=1 width=273)
 ->  Seq Scan on pg_class  (cost=0.00..17.14 rows=414 width=273)

If so, we once tried to implement such function for getting memory 
contexts.
However, this attempt didn't succeed because it could lead dead lock 
situation[1].


I think the same problem can occur when implementing 
pg_get_query_plan().


[1] 
https://www.postgresql.org/message-id/9a50371e15e741e295accabc72a41df1%40oss.nttdata.com


Regards,
Étienne BERSAC
Dalibo


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: trying again to get incremental backup

2023-10-24 Thread Robert Haas
On Mon, Oct 23, 2023 at 7:56 PM David Steele  wrote:
> > I also think a lot of the use of the low-level API is
> > driven by it being just too darn slow to copy the whole database, and
> > incremental backup can help with that in some circumstances.
>
> I would argue that restore performance is *more* important than backup
> performance and this patch is a step backward in that regard. Backups
> will be faster and less space will be used in the repository, but
> restore performance is going to suffer. If the deltas are very small the
> difference will probably be negligible, but as the deltas get large (and
> especially if there are a lot of them) the penalty will be more noticeable.

I think an awful lot depends here on whether the repository is local
or remote. If you have filesystem access to wherever the backups are
stored anyway, I don't think that using pg_combinebackup to write out
a new data directory is going to be much slower than copying one data
directory from the repository to wherever you'd actually use the
backup. It may be somewhat slower because we do need to access some
data in every involved backup, but I don't think it should be vastly
slower because we don't have to read every backup in its entirety. For
each file, we read the (small) header of the newest incremental file
and every incremental file that precedes it until we find a full file.
Then, we construct a map of which blocks need to be read from which
sources and read only the required blocks from each source. If all the
blocks are coming from a single file (because there are no incremental
for a certain file, or they contain no blocks) then we just copy the
entire source file in one shot, which can be optimized using the same
tricks we use elsewhere. Inevitably, this is going to read more data
and do more random I/O than just a flat copy of a directory, but it's
not terrible. The overall amount of I/O should be a lot closer to the
size of the output directory than to the sum of the sizes of the input
directories.

Now, if the repository is remote, and you have to download all of
those backups first, and then run pg_combinebackup on them afterward,
that is going to be unpleasant, unless the incremental backups are all
quite small. Possibly this could be addressed by teaching
pg_combinebackup to do things like accessing data over HTTP and SSH,
and relatedly, looking inside tarfiles without needing them unpacked.
For now, I've left those as ideas for future improvement, but I think
potentially they could address some of your concerns here. A
difficulty is that there are a lot of protocols that people might want
to use to push bytes around, and it might be hard to keep up with the
march of progress.

I do agree, though, that there's no such thing as a free lunch. I
wouldn't recommend to anyone that they plan to restore from a chain of
100 incremental backups. Not only might it be slow, but the
opportunities for something to go wrong are magnified. Even if you've
automated everything well enough that there's no human error involved,
what if you've got a corrupted file somewhere? Maybe that's not likely
in absolute terms, but the more files you've got, the more likely it
becomes. What I'd suggest someone do instead is periodically do
pg_combinebackup full_reference_backup oldest_incremental -o
new_full_reference_backup; rm -rf full_reference_backup; mv
new_full_reference_backup full_reference_backup. The new full
reference backup is intended to still be usable for restoring
incrementals based on the incremental it replaced. I hope that, if
people use the feature well, this should limit the need for really
long backup chains. I am sure, though, that some people will use it
poorly. Maybe there's room for more documentation on this topic.

> I was concerned with the difficulty of trying to stage the correct
> backups for pg_combinebackup, not whether it would recognize that the
> needed data was not available and then error appropriately. The latter
> is surmountable within pg_combinebackup but the former is left up to the
> user.

Indeed.

> One note regarding the patches. I feel like
> v5-0005-Prototype-patch-for-incremental-backup should be split to have
> the WAL summarizer as one patch and the changes to base backup as a
> separate patch.
>
> It might not be useful to commit one without the other, but it would
> make for an easier read. Just my 2c.

Yeah, maybe so. I'm not quite ready to commit to doing that split as
of this writing but I will think about it and possibly do it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-24 Thread Peter Eisentraut

On 24.10.23 09:58, Andrei Zubkov wrote:

During last moving to the current commitfest this patch have lost its
reviewers list. With respect to reviewers contribution in this patch, I
think reviewers list should be fixed.


I don't think it's the purpose of the commitfest app to track how *has* 
reviewed a patch.  The purpose is to plan and allocate *current* work. 
If we keep a bunch of reviewers listed on a patch who are not actually 
reviewing the patch, then that effectively blocks new reviewers from 
signing up and the patch will not make progress.


Past reviewers should of course be listed in the commit message, the 
release notes, etc. as appropriate.






Open a streamed block for transactional messages during decoding

2023-10-24 Thread Zhijie Hou (Fujitsu)
Hi,

While reviewing the test_decoding code, I noticed that when skip_empty_xacts
option is specified, it doesn't open the streaming block( e.g.
pg_output_stream_start) before streaming the transactional MESSAGE even if it's
the first change in a streaming block.

It looks inconsistent with what we do when streaming DML
changes(e.g. pg_decode_stream_change()).

Here is a small patch to open the stream block in this case.

Best Regards,
Hou Zhijie



0001-Open-a-streamed-block-for-transactional-messages-dur.patch
Description:  0001-Open-a-streamed-block-for-transactional-messages-dur.patch


d844cd75a and postgres_fdw

2023-10-24 Thread Devrim Gündüz

Hi,

I'm seeing an issue after upgrading from 12.13 to 15.4. This happens
when we run a query against a foreign table (fdw on the same instance to
a different database) -- but does not appear when we get rid of
postgres_fdw:

ERROR:  cursor can only scan forward
HINT:  Declare it with SCROLL option to enable backward scan.
CONTEXT:  remote SQL command: MOVE BACKWARD ALL IN c1

SQL state: 55000

I attached the query. The name of the foreign table is
"foobar.sys_user".

Looks like the bug #17889, and this is the last email in that thread: 
https://www.postgresql.org/message-id/1852635.1682808624%40sss.pgh.pa.us

OTOH, same query works (against the FDW) when we remove the following
WHERE clause:

WHERE
tbl.table_status = 'A'
AND tbl.table_id <> 1
AND tbl.table_id <> - 2

Any hints?

Regards,

-- 
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR
SELECT
*
FROM
foobar.sys_data_table_info AS tbl
INNER JOIN foobar.sys_folder_items fi ON tbl.table_id = fi.item_id
AND fi.item_type = 'TABLE'
LEFT JOIN foobar.sys_user_foreign uc ON tbl.created_by = uc.user_id
LEFT JOIN foobar.sys_user_foreign uu ON tbl.updated_by = uu.user_id
WHERE
tbl.table_status = 'A'
AND tbl.table_id <> 1
AND tbl.table_id <> - 2
ORDER BY
updated_at DESC NULLS LAST


RE: logical decoding and replication of sequences, take 2

2023-10-24 Thread Zhijie Hou (Fujitsu)
On Thursday, October 12, 2023 11:06 PM Tomas Vondra 
 wrote:
>

Hi,

I have been reviewing the patch set, and here are some initial comments.

1.

I think we need to mark the RBTXN_HAS_STREAMABLE_CHANGE flag for transactional
sequence change in ReorderBufferQueueChange().

2.

ReorderBufferSequenceIsTransactional

It seems we call the above function once in sequence_decode() and call it again
in ReorderBufferQueueSequence(), would it better to avoid the second call as
the hashtable search looks not cheap.

3.

The patch cleans up the sequence hash table when COMMIT or ABORT a transaction
(via ReorderBufferAbort() and ReorderBufferReturnTXN()), while it doesn't seem
destory the hash table when PREPARE the transaction. It's not a big porblem but
would it be better to release the memory earlier by destory the table for
prepare ?

4.

+pg_decode_stream_sequence(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
...
+   /* output BEGIN if we haven't yet, but only for the transactional case 
*/
+   if (transactional)
+   {
+   if (data->skip_empty_xacts && !txndata->xact_wrote_changes)
+   {
+   pg_output_begin(ctx, data, txn, false);
+   }
+   txndata->xact_wrote_changes = true;
+   }

I think we should call pg_output_stream_start() instead of pg_output_begin()
for streaming sequence changes.

5.
+   /*
+* Schema should be sent using the original relation because it
+* also sends the ancestor's relation.
+*/
+   maybe_send_schema(ctx, txn, relation, relentry);

The comment seems a bit misleading here, I think it was used for the partition
logic in pgoutput_change().

Best Regards,
Hou zj


A case for GIST supporting ORDER BY

2023-10-24 Thread Michał Kłeczek
Hi,

Some time ago I’ve provided some details with the issues we face when trying to 
use GIST and partitioning at the same time in the postgresql-general mailing 
list:
https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org
GIST index and ORDER BY
postgresql.org

We decided to go with the solution to partition our table by:

RANGE (‘2100-01-01' <-> operation_date).

While it (somewhat) solves partition pruning issues described above there is 
another problem:
It is impossible to create a unique constraint on the partitioned table.

So now we cannot use INSERT … ON CONFLICT (…) DO UPDATE



My question to hackers:
Would it be feasible to implement ORDER BY column GIST index (only) scan for 
types with total order and sensible greatest and least values?

Thanks,
Michal


Re: Bug: RLS policy FOR SELECT is used to check new rows

2023-10-24 Thread Dean Rasheed
On Tue, 24 Oct 2023 at 09:36, Laurenz Albe  wrote:
>
> I'd say that this error is wrong.  The FOR SELECT policy should be applied
> to the WHERE condition, but certainly not to check new rows.
>

Yes, I had the same thought recently. I would say that the SELECT
policies should only be used to check new rows if the UPDATE has a
RETURNING clause and SELECT permissions are required on the target
relation.

In other words, it should be OK to UPDATE a row to new values that are
not visible according to the table's SELECT policies, provided that
the UPDATE command does not attempt to return those new values. That
would be consistent with what we do for INSERT.

Note, that the current behaviour goes back a long way, though it's not
quite clear whether this was intentional [1].

[1] 
https://github.com/postgres/postgres/commit/7d8db3e8f37aec9d252353904e77381a18a2fa9f

Regards,
Dean




Re: Synchronizing slots from primary to standby

2023-10-24 Thread Drouvot, Bertrand

Hi,

On 10/24/23 7:44 AM, Ajin Cherian wrote:

On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
 wrote:


@@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
  SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
  }

+   /* set failover in the slot, as requested */
+   slot->data.failover = ctx->failover;
+

I think we can get rid of this change in CreateDecodingContext().


Yes, I too noticed this in my testing, however just removing this from
CreateDecodingContext will not allow us to change the slot's failover flag
using Alter subscription.


Oh right.


I am thinking of moving this change to
StartLogicalReplication prior to calling CreateDecodingContext by
parsing the command options in StartReplicationCmd
without adding it to the LogicalDecodingContext.



Yeah, that looks like a good place to update "failover".

Doing more testing and I have a couple of remarks about he current behavior.

1) Let's imagine that:

- there is no standby
- standby_slot_names is set to a valid slot on the primary (but due to the 
above, not linked to any standby)
- then a create subscription on a subscriber WITH (failover = true) would start 
the
synchronisation but never finish (means leaving a "synchronisation" slot like 
"pg_32811_sync_24576_7293415241672430356"
in place coming from ReplicationSlotNameForTablesync()).

That's expected, but maybe we should emit a warning in 
WalSndWaitForStandbyConfirmation() on the primary when there is
a slot part of standby_slot_names which is not active/does not have an 
active_pid attached to it?

2) When we create a subscription, another slot is created during the 
subscription synchronization, namely
like "pg_16397_sync_16388_7293447291374081805" (coming from 
ReplicationSlotNameForTablesync()).

This extra slot appears to have failover also set to true.

So, If the standby refresh the list of slot to sync when the subscription is 
still synchronizing we'd see things like
on the standby:

LOG:  waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) 
to pass local slot LSN (0/C0034840) and and catalog xmin (756)
LOG:  wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog 
xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756)
LOG:  waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN 
(0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and and catalog 
xmin (756)
WARNING:  slot "pg_16397_sync_16388_7293447291374081805" disappeared from the 
primary, aborting slot creation

I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have 
failover set to true. If there is a failover
during the subscription creation, better to re-launch the subscription instead?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Simplify create_merge_append_path a bit for clarity

2023-10-24 Thread Alena Rybakina

Hi!

On 11.08.2023 05:31, Richard Guo wrote:

As explained in the comments for generate_orderedappend_paths, we don't
currently support parameterized MergeAppend paths, and it doesn't seem
like going to change anytime soon.  Based on that,  we could simplify
create_merge_append_path a bit, such as set param_info to NULL directly
rather than call get_appendrel_parampathinfo() for it.  We already have
an Assert on that in create_merge_append_plan.

I understand that the change would not make any difference for
performance, it's just for clarity's sake.


I agree with you, and we can indeed directly set the param_info value to 
NULL, and there are enough comments here to explain.


I didn't find anything else to add in your patch.

--
Regards,
Alena Rybakina





Bug: RLS policy FOR SELECT is used to check new rows

2023-10-24 Thread Laurenz Albe
Try this as a user with NOBYPASSRLS:


CREATE TABLE rlsbug (deleted boolean);

INSERT INTO rlsbug VALUES (FALSE);

CREATE POLICY p_sel ON rlsbug FOR SELECT TO laurenz USING (NOT deleted);

CREATE POLICY p_upd ON rlsbug FOR UPDATE TO laurenz USING (TRUE);

ALTER TABLE rlsbug ENABLE ROW LEVEL SECURITY;  
ALTER TABLE rlsbug FORCE ROW LEVEL SECURITY;

UPDATE rlsbug SET deleted = TRUE WHERE NOT deleted;  
ERROR:  new row violates row-level security policy for table "rlsbug"


I'd say that this error is wrong.  The FOR SELECT policy should be applied  
to the WHERE condition, but certainly not to check new rows.

Yours,
Laurenz Albe




Re: Replace references to malloc() in libpq documentation with generic language

2023-10-24 Thread Daniel Gustafsson
> On 24 Oct 2023, at 07:13, Gurjeet Singh  wrote:

> The user does not benefit from knowing that libpq allocates some/all memory
> using malloc(). Mentioning malloc() here has a few downsides, and almost no
> benefits.

I'm not entirely convinced that replacing "malloc" with "allocated on the heap"
improves the documentation.  I do agree with this proposed change though:

-  all the space that will be freed by .
+  all the memory that will be freed by .

--
Daniel Gustafsson





Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2023-10-24 Thread Kyotaro Horiguchi
At Tue, 24 Oct 2023 07:37:22 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> Dear Horiguchi-san,
> 
> Thanks for updates! I was quite not sure the Windows env, but I can post 
> comments.
> (We need reviews by windows-friendly developers...)

Indeed, I haven't managed to successfully build using Meson on
Windows...

> Formatting of messages for write_stderr() seem different from others. In v3,
> I slightly modified for readability like below. I wanted to let you know just 
> in case
> because you did not say anything about these changes...

Ah. Sorry, I was lazy about the messages because I didn't regard this
to be at that stage yet.

In the attached, fixed the existing two messages, and adjusted one
message to display an error code, all in the consistent format.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 65013b5ccebd101f91be0e46b5209dc9cfcc7d5f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 24 Oct 2023 11:43:57 +0900
Subject: [PATCH v5 1/3] Disable autoruns for cmd.exe on Windows

On Windows, we use cmd.exe to launch the postmaster process for easy
redirection setup. However, cmd.exe may execute other programs at
startup due to autorun configurations. This patch adds /D flag to the
launcher cmd.exe command line to disable autorun settings written in
the registry.
---
 src/bin/pg_ctl/pg_ctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 807d7023a9..3ac2fcc004 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -553,11 +553,11 @@ start_postmaster(void)
 		else
 			close(fd);
 
-		cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"",
+		cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"",
 	   comspec, exec_path, pgdata_opt, post_opts, DEVNULL, log_file);
 	}
 	else
-		cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" 2>&1\"",
+		cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" 2>&1\"",
 	   comspec, exec_path, pgdata_opt, post_opts, DEVNULL);
 
 	if (!CreateRestrictedProcess(cmd, , false))
-- 
2.39.3

>From 18c9a319b5160721d08cc1d3a3df770a63756f65 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 24 Oct 2023 14:46:50 +0900
Subject: [PATCH v5 2/3] Improve pg_ctl postmaster process check on Windows

Currently pg_ctl on Windows does not verify that it actually executed
a postmaster process due to lack of process ID knowledge. This can
lead to false positives in cases where another pg_ctl instance starts
a different server simultaneously.
This patch adds the capability to identify the process ID of the
launched postmaster on Windows, similar to other OS versions, ensuring
more reliable detection of concurrent server startups.
---
 src/bin/pg_ctl/pg_ctl.c | 102 
 1 file changed, 93 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 3ac2fcc004..221049db0e 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -132,6 +132,7 @@ static void adjust_data_dir(void);
 
 #ifdef WIN32
 #include 
+#include 
 static bool pgwin32_IsInstalled(SC_HANDLE);
 static char *pgwin32_CommandLine(bool);
 static void pgwin32_doRegister(void);
@@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *);
 static void pgwin32_doRunAsService(void);
 static int	CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service);
 static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken);
+static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid);
 #endif
 
 static pid_t get_pgpid(bool is_status_request);
@@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint)
 			/* File is complete enough for us, parse it */
 			pid_t		pmpid;
 			time_t		pmstart;
-
+#ifndef WIN32
+			pid_t		wait_pid = pm_pid;
+#else
+			pid_t		wait_pid = pgwin32_find_postmaster_pid(pm_pid);
+#endif
 			/*
 			 * Make sanity checks.  If it's for the wrong PID, or the recorded
 			 * start time is before pg_ctl started, then either we are looking
@@ -619,14 +625,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint)
 			 */
 			pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
 			pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
-			if (pmstart >= start_time - 2 &&
-#ifndef WIN32
-pmpid == pm_pid
-#else
-			/* Windows can only reject standalone-backend PIDs */
-pmpid > 0
-#endif
-)
+
+			if (pmstart >= start_time - 2 && pmpid == wait_pid)
 			{
 /*
  * OK, seems to be a valid pidfile from our child.  Check the
@@ -1950,6 +1950,90 @@ GetPrivilegesToDelete(HANDLE hToken)
 
 	return tokenPrivs;
 }
+
+/*
+ * Find the PID of the launched postmaster.
+ *
+ * On Windows, the cmd.exe doesn't support the exec command. As a result, we
+ * don't directly get the postmaster's PID. This function identifies the PID of
+ * the postmaster started by the child cmd.exe.
+ *
+ * Returns the postmaster's 

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2023-10-24 Thread Andrei Zubkov
Hi,

During last moving to the current commitfest this patch have lost its
reviewers list. With respect to reviewers contribution in this patch, I
think reviewers list should be fixed.

regards,

Andrei Zubkov
Postgres Professional
The Russian Postgres Company





Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-24 Thread Bharath Rupireddy
On Tue, Oct 24, 2023 at 11:32 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > If we don't want to keep it generic then we should use something like
> > 'contains_decodable_change'. 'is_change_decodable' could have suited
> > here if we were checking a particular change.
>
> I kept the name for now. How does Bharath think?

No more bikeshedding from my side. +1 for processing_required as-is.

> > > 6. An optimization in count_old_cluster_logical_slots(void): Turn
> > > slot_count to a function static variable so that the for loop isn't
> > > required every time because the slot count is prepared in
> > > get_old_cluster_logical_slot_infos only once and won't change later
> > > on. Do you see any problem with the following? This saves a few CPU
> > > cycles when there are large number of replication slots.
> > > {
> > > static int slot_count = 0;
> > > static bool first_time = true;
> > >
> > > if (first_time)
> > > {
> > > for (int dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
> > > slot_count += old_cluster.dbarr.dbs[dbnum].slot_arr.nslots;
> > >
> > > first_time = false;
> > > }
> > >
> > > return slot_count;
> > > }
> > >
> >
> > This may not be a problem but this is also not a function that will be
> > used frequently. I am not sure if adding such code optimizations is
> > worth it.
>
> Not addressed.

count_old_cluster_logical_slots is being called 3 times during
pg_upgrade and every time counting number of slots for all the
databases seems redundant IMV especially given the fact that the slot
count is computed once at the beginning and never changes. When the
replication slots on the cluster are on the higher side, every time
counting *may* prove costly. And, the use of static variables isn't a
huge change requiring a different set of infra or as such, it's a
simple pattern.

Having said above, if others don't see a merit in it, I'm okay to
withdraw my comment.

> Below commands are an example of the test.
>
> ```
> # test PG9.5 -> patched HEAD
> $ oldinstall=/home/hayato/older/pg95 make check 
> PROVE_TESTS='t/003_upgrade_logical_replication_slots.pl'

Oh, I get it. Thanks.

> Also, based on a comment [2], the upgrade function was renamed to
> 'binary_upgrade_logical_slot_has_caught_up'.

+1.

I spent some time on the v57 patch and it looks good to me - tests are
passing, no complaints from pgindent and pgperltidy. I turned the CF
entry https://commitfest.postgresql.org/45/4273/ to RfC.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




RE: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2023-10-24 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san,

Thanks for updates! I was quite not sure the Windows env, but I can post 
comments.
(We need reviews by windows-friendly developers...)

> Other error cases will fit to "shouldn't occur under normal
> conditions" errors.

Formatting of messages for write_stderr() seem different from others. In v3,
I slightly modified for readability like below. I wanted to let you know just 
in case
because you did not say anything about these changes...

```
+   /* create a process snapshot */
+   hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
+   if (hSnapshot == INVALID_HANDLE_VALUE)
+   {
+   write_stderr(_("%s: could not create a snapshot: error code 
%lu\n"),
+progname, (unsigned long) 
GetLastError());
+   exit(1);
+   }
+
+   /* start iterating on the snapshot */
+   ppe.dwSize = sizeof(PROCESSENTRY32);
+   if (!Process32First(hSnapshot, ))
+   {
+   write_stderr(_("%s: cound not retrieve information about the 
process: error code %lu\n"),
+progname, GetLastError());
+   exit(1);
+   }
+
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: Synchronizing slots from primary to standby

2023-10-24 Thread Peter Smith
Here are some review comments for patch v25-0002

(additional to v25-0002 review comments [1])

==
src/backend/catalog/system_views.sql

1.
@@ -1003,7 +1003,8 @@ CREATE VIEW pg_replication_slots AS
 L.safe_wal_size,
 L.two_phase,
 L.conflicting,
-L.failover
+L.failover,
+L.synced_slot
 FROM pg_get_replication_slots() AS L
 LEFT JOIN pg_database D ON (L.datoid = D.oid);

AFAICT the patch is missing PG DOCS descriptions for these new view attributes.

==
src/backend/replication/logical/launcher.c

2. slotsync_remove_obsolete_dbs

+
+ /*
+ * TODO: Take care of of removal of old 'synced' slots for the dbs which
+ * are no longer eligible for slot-sync.
+ */

typo: "of of"

~~~

3.
+ /*
+ * Make sure that concerned WAL is received before syncing slot to target
+ * lsn received from the primary.
+ *
+ * This check should never pass as on the primary, we have waited for
+ * standby's confirmation before updating the logical slot. But to take
+ * care of any bug in that flow, we should retain this check.
+ */
+ if (remote_slot->confirmed_lsn > WalRcv->latestWalEnd)
+ {
+ ereport(LOG,
+ errmsg_internal("skipping sync of slot \"%s\" as the received slot-sync "
+ "lsn %X/%X is ahead of the standby position %X/%X",
+ remote_slot->name,
+ LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
+ LSN_FORMAT_ARGS(WalRcv->latestWalEnd)));
+ return;
+ }

Would elog be better here than using ereport(LOG, errmsg_internal...);
IIUC it does the same thing?

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPspseC03Fhsi%3DOqOtksagspE%2B0MVOhrhhUb64cc_4SE1w%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Synchronizing slots from primary to standby

2023-10-24 Thread Peter Smith
Here are some review comments for v24-0001

==
1. GENERAL - failover slots terminology

There is inconsistent terminology, such as below. Try to use the same
wording everywhere.
- failover logical slots
- failover slots
- logical failover slots
- logical replication failover slots
- etc.

These are in many places - comments, function names, constants etc.

~~~

2. GENERAL - THE

s/primary.../the primary.../
s/standby.../the standby.../

Missing "the" problems remain in multiple places in the patch.

~~~

3. GENERAL - messages

I searched all the ereports and elogs (the full list is below only for
reference). There are many little quirks:

3a. Sometimes messages say "primary"; sometimes "primary server" etc.
Be consistent.

3b. /primary/the primary/

3c. Sometimes messages include errcode and sometimes they do not; Are
they deliberate or are there missing errcodes?

3d. At least one message has unwanted trailing space

3e. Sometimes using errcode and/or errmsg enclosed in parentheses;
sometimes not. AFAIK it is not necessary anymore.

3f. Inconsistent terminology "slot" V "failover slots" V "failover
logical slots" etc mentioned in the previous review comment #1

3g. Sometimes messages "slot creation aborted"; Sometimes "aborting
slot creation". Be consistent.

3h. s/lsn/LSN/

3i. s/move it backward/move it backwards/

3j. Sometimes LOG message starts uppercase; Sometimes lowercase. Be consistent.

3k. typo: s/and and/and/

3l. "worker %d" V "worker%d"

~

Messages:

ereport(ERROR, (errmsg("could not receive failover slots dbinfo from
the primary server: %s", pchomp(PQerrorMessage(conn->streamConn);
ereport(ERROR, (errmsg("invalid response from primary server"),
errdetail("Could not get failover slots dbinfo: got %d fields, "
"expected 1", nfields)));
ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("invalid
connection string syntax: %s", errcopy)));
ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("replication slot-sync worker slot %d is " "empty, cannot
attach", slot)));
ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("replication slot-sync worker slot %d is " "already used by
another worker, cannot attach", slot)));
ereport(ERROR, (errmsg("could not connect to the primary server: %s", err)));
ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot use replication slot \"%s\" for logical decoding",
NameStr(slot->data.name)), errdetail("This slot is being synced  from
the primary."), errhint("Specify another replication slot.")));
ereport(ERROR, (errmsg("could not fetch slot info for slot \"%s\"
from" " the primary: %s", remote_slot->name, res->err)));
ereport(ERROR, (errmsg("could not fetch slot info for slot \"%s\"
from" " the primary: %s", remote_slot->name, res->err)));
ereport(ERROR, (errmsg("could not fetch invalidation cause for slot
\"%s\" from" " primary: %s", slot_name, res->err)));
ereport(ERROR, (errmsg("slot \"%s\" disappeared from the primary", slot_name)));
ereport(ERROR, (errmsg("could not fetch failover logical slots info
from the primary: %s", res->err)));
ereport(ERROR, (errmsg("could not connect to the primary server: %s", err)));
ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("could not map dynamic shared memory " "segment for slot-sync
worker")));
ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot drop replication slot \"%s\"", name), errdetail("This
slot is being synced from the primary.")));
ereport(ERROR, (errmsg("could not receive failover slots dbinfo from
the primary server: %s", pchomp(PQerrorMessage(conn->streamConn);
ereport(ERROR, (errmsg("invalid response from primary server"),
errdetail("Could not get failover slots dbinfo: got %d fields, "
"expected 1", nfields)));
ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("invalid
connection string syntax: %s", errcopy)));

ereport(WARNING, (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
errmsg("out of background worker slots"), errhint("You might need to
increase %s.", "max_worker_processes")));
ereport(WARNING, (errmsg("replication slot-sync worker failed to
attach to " "worker-pool slot %d", worker_slot)));
ereport(WARNING, errmsg("skipping slots synchronization as
primary_slot_name " "is not set."));
ereport(WARNING, errmsg("skipping slots synchronization as
hot_standby_feedback " "is off."));
ereport(WARNING, errmsg("skipping slots synchronization as dbname is
not " "specified in primary_conninfo."));
ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("slot-sync wait for slot %s interrupted by promotion, " "slot
creation aborted", remote_slot->name)));
ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("slot-sync wait for slot %s interrupted by promotion, " "slot
creation aborted", remote_slot->name)));
ereport(WARNING, (errmsg("slot \"%s\" disappeared from the primary,
aborting" " slot creation", remote_slot->name)));

Re: RFC: Logging plan of the running query

2023-10-24 Thread Étienne BERSAC
> Hi,
> 
> Yeah, and when we have a situation where we want to run
> pg_log_query_plan(), we can run it in any environment as long as it
> is bundled with the core.

Is it possible to get the plan of a backend programmatically without
access to the logs?

Something like pg_get_query_plan(pid, format) which output would be the
same as EXPLAIN.

Regards,
Étienne BERSAC
Dalibo




Re: Fix output of zero privileges in psql

2023-10-24 Thread Laurenz Albe
On Mon, 2023-10-23 at 22:43 -0400, Tom Lane wrote:
> Laurenz Albe  writes:
> > On Mon, 2023-10-23 at 11:37 -0700, David G. Johnston wrote:
> > > I do believe that we should be against exposing, like in this case, any 
> > > internal
> > > implementation detail that encodes something (e.g., default privileges) 
> > > as NULL
> > > in the catalogs, to the user of the psql meta-commands.
> 
> > Sure, it would be best to hide this implementation detail from the user.
> > The correct way to do that would be to fake an ACL entry like 
> > "laurenz=arwdDxt/laurenz"
> > if there is a NULL in the catalog, but that would add a ton of special-case
> > code to psql, which does not look appealing at all.
> 
> For better or worse, that *is* the backend's catalog representation,
> and I don't think that psql would be doing our users a service by
> trying to obscure the fact.  They'd run into it anyway the moment
> they look at the catalogs with anything but a \d-something command.

... for example with a client like pgAdmin, which is a frequent choice
of many PostgreSQL beginners (they display empty privileges).

Yes, it is "(default)" or NULL.  The former is friendlier for beginners,
the latter incurs less backward incompatibility.

I could live with either solution, but I am still leaning towards NULL.

I ran the regression tests with a patch that displays "(default)",
and I counted 22 failures, excluding the one added by my patch.
The tests can of course be fixed, but perhaps that serves as a measure
of the backward incompatibility.

Yours,
Laurenz Albe




Re: Synchronizing slots from primary to standby

2023-10-24 Thread Drouvot, Bertrand

Hi,

On 10/23/23 2:56 PM, shveta malik wrote:

On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand
 wrote:



We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if 
there
is new synced slot(s) to be created on the standby. Do we want to keep this 
behavior
for V1?



I think for the slotsync workers case, we should reduce the naptime in
the launcher to say 30sec and retain the default one of 3mins for
subscription apply workers. Thoughts?



Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new
API on the standby that would refresh the list of sync slot at wish, thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add trailing commas to enum definitions

2023-10-24 Thread Junwang Zhao
On Tue, Oct 24, 2023 at 4:34 AM Nathan Bossart  wrote:
>
> On Mon, Oct 23, 2023 at 05:55:32PM +0800, Junwang Zhao wrote:
> > On Mon, Oct 23, 2023 at 2:37 PM Peter Eisentraut  
> > wrote:
> >> Since C99, there can be a trailing comma after the last value in an enum
> >
> > C99 allows us to do this doesn't mean we must do this, this is not
> > inconsistent IMHO, and this will pollute the git log messages, people
> > may *git blame* the file and see the reason for the introduction of the
> > line.
>
> I suspect that your concerns about git-blame could be resolved by adding
> this commit to .git-blame-ignore-revs.  From a long-term perspective, I
> think standardizing on the trailing comma style will actually improve
> git-blame because patches won't need to add a comma to the previous line
> when adding a value.

make sense, +1 from me now.

>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com



-- 
Regards
Junwang Zhao




RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-24 Thread Hayato Kuroda (Fujitsu)
Dear Bharath, Amit,

Thanks for reviewing! PSA new version.
I addressed comments which have not been claimed.

> On Mon, Oct 23, 2023 at 2:00 PM Bharath Rupireddy
>  wrote:
> >
> > On Mon, Oct 23, 2023 at 11:10 AM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > Thank you for reviewing! PSA new version.
> >
> > > > 6. A nit: how about is_decodable_txn or is_decodable_change or some
> > > > other instead of just a plain name processing_required?
> > > > +/* Do we need to process any change in 'fast_forward' mode? */
> > > > +boolprocessing_required;
> > >
> > > I preferred current one. Because not only decodable txn, non-txn change 
> > > and
> > > empty transactions also be processed.
> >
> > Right. It's not the txn, but the change. processing_required seems too
> > generic IMV. A nit: is_change_decodable or something?
> >
> 
> If we don't want to keep it generic then we should use something like
> 'contains_decodable_change'. 'is_change_decodable' could have suited
> here if we were checking a particular change.

I kept the name for now. How does Bharath think?

> > Thanks for the patch. Here are few comments on v56 patch:
> >
> > 1.
> > + *
> > + * Although this function is currently used only during pg_upgrade, there 
> > are
> > + * no reasons to restrict it, so IsBinaryUpgrade is not checked here.
> >
> > This comment isn't required IMV, because anyone looking at the code
> > and callsites can understand it.

Removed.

> > 2. A nit: IMV "This is a special purpose ..." statement seems redundant.
> > + *
> > + * This is a special purpose function to ensure that the given slot can be
> > + * upgraded without data loss.
> >
> > How about
> >
> > Verify that the given replication slot has consumed all the WAL changes.
> > If there's any decodable WAL record after the slot's
> > confirmed_flush_lsn, the slot's consumer will lose that data after the
> > slot is upgraded.
> > Returns true if there are no decodable WAL records after the
> > confirmed_flush_lsn. Otherwise false.
> >
> 
> Personally, I find the current comment succinct and clear.

I kept current one.

> > 3.
> > +if (PG_ARGISNULL(0))
> > +elog(ERROR, "null argument to
> > binary_upgrade_validate_wal_records is not allowed");
> >
> > I can see the above style is referenced from
> > binary_upgrade_create_empty_extension, but IMV the following looks
> > better and latest (ereport is new style than elog)
> >
> > ereport(ERROR,
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> >  errmsg("replication slot name cannot be null")));
> >
> 
> Do you have any theory for making elog to ereport? I am not completely
> sure but as this and related function is used internally, so using
> elog seems reasonable. Also, I find keeping it consistent with the
> existing error message is also reasonable. We can change both later
> together if we get a broader agreement.

I kept current style. elog() was used here because I regarded it as
"cannot happen" error. According to the doc [1], elog() is still used
for the purpose.

> > 4. The following comment seems frivolous, the code tells it all.
> > Please remove the comment.
> > +
> > +/* No need to check this slot, seek to new one */
> > +continue;

Removed.

> > 5. A typo - s/gets/Gets
> > + * gets the LogicalSlotInfos for all the logical replication slots of the

Replaced.

> > 6. An optimization in count_old_cluster_logical_slots(void): Turn
> > slot_count to a function static variable so that the for loop isn't
> > required every time because the slot count is prepared in
> > get_old_cluster_logical_slot_infos only once and won't change later
> > on. Do you see any problem with the following? This saves a few CPU
> > cycles when there are large number of replication slots.
> > {
> > static int slot_count = 0;
> > static bool first_time = true;
> >
> > if (first_time)
> > {
> > for (int dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
> > slot_count += old_cluster.dbarr.dbs[dbnum].slot_arr.nslots;
> >
> > first_time = false;
> > }
> >
> > return slot_count;
> > }
> >
> 
> This may not be a problem but this is also not a function that will be
> used frequently. I am not sure if adding such code optimizations is
> worth it.

Not addressed.

> > 7. A typo: s/slotname/slot name. "slot name" looks better in user
> > visible messages.
> > +pg_log(PG_VERBOSE, "slotname: \"%s\", plugin: \"%s\",
> two_phase: %s",
> >
> 
> If we want to follow other parameters then we can even use slot_name.

Changed to slot_name.

Below part is replies for remained comments:

>8.
>+else
>+{
>+test_upgrade_from_pre_PG17($old_publisher, $new_publisher,
>+@pg_upgrade_cmd);
>+}
>Will this ever be tested in current TAP test framework? I mean, will
>the TAP test framework allow testing upgrades from one PG version to
>another PG version?

Yes, the TAP tester allow to do 

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2023-10-24 Thread Kyotaro Horiguchi
At Mon, 23 Oct 2023 08:57:19 +, "Hayato Kuroda (Fujitsu)" 
 wrote in 
> > I agree with your suggestion.  Ultimately, if there's a possibility
> > for this to be committed, the message will be consolidated to "could
> > not start server".
> 
> Based on the suggestion, I tried to update the patch.
> A new argument is_valid is added for reporting callee. Also, reporting formats
> are adjusted based on other functions. How do you think?

An equivalent check is already done shortly afterward in the calling
function. Therefore, we can simply remove the code path for "launcher
shell died", and it will work the same way. Please find the attached.

Other error cases will fit to "shouldn't occur under normal
conditions" errors.

There is a possibility that the launcher shell terminates while
postmaster is running. Even in such a case, the server continue
working without any problems. I contemplated accomodating this case
but the effort required seemed disproportionate to the possibility.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 530dc21be72e4e7f5ea199a9fceee00bbb88cf75 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 24 Oct 2023 11:43:57 +0900
Subject: [PATCH v4 1/3] Disable autoruns for cmd.exe on Windows

On Windows, we use cmd.exe to launch the postmaster process for easy
redirection setup. However, cmd.exe may execute other programs at
startup due to autorun configurations. This patch adds /D flag to the
launcher cmd.exe command line to disable autorun settings written in
the registry.
---
 src/bin/pg_ctl/pg_ctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 807d7023a9..3ac2fcc004 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -553,11 +553,11 @@ start_postmaster(void)
 		else
 			close(fd);
 
-		cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"",
+		cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" >> \"%s\" 2>&1\"",
 	   comspec, exec_path, pgdata_opt, post_opts, DEVNULL, log_file);
 	}
 	else
-		cmd = psprintf("\"%s\" /C \"\"%s\" %s%s < \"%s\" 2>&1\"",
+		cmd = psprintf("\"%s\" /D /C \"\"%s\" %s%s < \"%s\" 2>&1\"",
 	   comspec, exec_path, pgdata_opt, post_opts, DEVNULL);
 
 	if (!CreateRestrictedProcess(cmd, , false))
-- 
2.39.3

>From 913a0fbe0937331fca6ea4099834ed3001714339 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 24 Oct 2023 14:46:50 +0900
Subject: [PATCH v4 2/3] Improve pg_ctl postmaster process check on Windows

Currently pg_ctl on Windows does not verify that it actually executed
a postmaster process due to lack of process ID knowledge. This can
lead to false positives in cases where another pg_ctl instance starts
a different server simultaneously.
This patch adds the capability to identify the process ID of the
launched postmaster on Windows, similar to other OS versions, ensuring
more reliable detection of concurrent server startups.
---
 src/bin/pg_ctl/pg_ctl.c | 102 
 1 file changed, 93 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 3ac2fcc004..9c0168b075 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -132,6 +132,7 @@ static void adjust_data_dir(void);
 
 #ifdef WIN32
 #include 
+#include 
 static bool pgwin32_IsInstalled(SC_HANDLE);
 static char *pgwin32_CommandLine(bool);
 static void pgwin32_doRegister(void);
@@ -142,6 +143,7 @@ static void WINAPI pgwin32_ServiceMain(DWORD, LPTSTR *);
 static void pgwin32_doRunAsService(void);
 static int	CreateRestrictedProcess(char *cmd, PROCESS_INFORMATION *processInfo, bool as_service);
 static PTOKEN_PRIVILEGES GetPrivilegesToDelete(HANDLE hToken);
+static pid_t pgwin32_find_postmaster_pid(pid_t shell_pid);
 #endif
 
 static pid_t get_pgpid(bool is_status_request);
@@ -609,7 +611,11 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint)
 			/* File is complete enough for us, parse it */
 			pid_t		pmpid;
 			time_t		pmstart;
-
+#ifndef WIN32
+			pid_t		wait_pid = pm_pid;
+#else
+			pid_t		wait_pid = pgwin32_find_postmaster_pid(pm_pid);
+#endif
 			/*
 			 * Make sanity checks.  If it's for the wrong PID, or the recorded
 			 * start time is before pg_ctl started, then either we are looking
@@ -619,14 +625,8 @@ wait_for_postmaster_start(pid_t pm_pid, bool do_checkpoint)
 			 */
 			pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
 			pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
-			if (pmstart >= start_time - 2 &&
-#ifndef WIN32
-pmpid == pm_pid
-#else
-			/* Windows can only reject standalone-backend PIDs */
-pmpid > 0
-#endif
-)
+
+			if (pmstart >= start_time - 2 && pmpid == wait_pid)
 			{
 /*
  * OK, seems to be a valid pidfile from our child.  Check the
@@ -1950,6 +1950,90 @@ GetPrivilegesToDelete(HANDLE hToken)
 
 	return tokenPrivs;
 }
+
+/*
+ * Find the PID of the launched postmaster.
+ *
+ * On Windows,