Re: Meson far from ready on Windows

2024-06-25 Thread Andres Freund
Hi,

On 2024-06-24 09:44:57 +0100, Dave Page wrote:
> To address Andres' concerns around mis-detection of dependencies, or other
> oddities such as required compiler flags not being included, I would
> suggest that a) that's happened very rarely, if ever, in the past, and b)
> we can always spit out an obvious warning if we've not been able to use
> cmake or pkgconfig for any particular dependencies.

I personally spent quite a few days hunting down issues related to this. Not
because I wanted to, but because it was causing breakage and nobody else was
looking.  For several years postgres didn't build against a modern perl, for
example, see the stuff leading up to
  http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ccc59a83cd97
but nobody seemed to care for a prolonged amount of time.

We have evidence of random build hackery all over the tree - often
entirely outdated, sometimes even *breaking* builds these days ([1]):

https://github.com/postgres/postgres/blob/master/src/interfaces/libpq/Makefile#L80-L88
(wrong library names for kerberos on 64bit systems, wrong ssl libnames for )

https://github.com/postgres/postgres/blob/master/contrib/bool_plperl/Makefile#L30
https://github.com/postgres/postgres/blob/master/src/pl/plpython/Makefile#L59-L72
https://github.com/postgres/postgres/blob/master/src/pl/tcl/Makefile#L35-L51
https://github.com/postgres/postgres/blob/master/config/python.m4#L62-L64x

There's plenty more, some of the more complicated cases are a bit less trivial
to search for.

Greetings,

Andres Freund

[1] 
https://postgr.es/m/CAGPVpCSKS9E0An4%3De7ZDnme%2By%3DWOcQFJYJegKO8kE9%3Dgh8NJKQ%40mail.gmail.com




Re: Meson far from ready on Windows

2024-06-25 Thread Andres Freund
Hi,

On 2024-06-21 12:20:49 +0100, Dave Page wrote:
> > I'm confused - the old build system wasn't flexible around this stuff *at
> > all*. Everyone had to patch it to get dependencies to work, unless you
> > chose
> > exactly the right source to download from - which was often not documented
> > or
> > outdated.
> >
> 
> As I noted above - as the "owner" of the official packages, I never did
> despite using a variety of upstream sources.

For reference, with 16 and src/tools/msvc:
- upstream zstd build doesn't work, wrong filename (libzstd.dll.a instead of 
libzstd.lib)
- upstream lz4 build doesn't work, wrong filename (liblz4.dll.a instead of 
liblz4.lib)
- openssl, from https://slproweb.com/products/Win32OpenSSL.htm , as our
  docs suggest: doesn't work, wrong filenames (openssl.lib instead of
  lib*64.lib, works if you delete lib/VC/sslcrypto64MD.lib)
- iconv/intl: mismatching library names (lib*.dll.a lib*.lib)

- zlib at least at least from some of the sources (it's hard to tell, because
  everything available is so outdated), wrong filenames

Upstream ICU works.

I gave up at this point, so I don't know if libxml, xslt and uuid work without
patching the sources.

Greetings,

Andres Freund




Re: Meson far from ready on Windows

2024-06-24 Thread Andres Freund
Hi,

On 2024-06-24 09:54:51 +0100, Dave Page wrote:
> > The old system was a major bottleneck. For one, there was no way to run all
> > tests. And even the tests that one could run, would run serially, leading
> > to
> > exceedingly long tests times. While that could partially be addressed by
> > having both buildsystems in parallel, the old one would frequently break
> > in a
> > way that one couldn't reproduce on other systems. And resource wise it
> > wasn't
> > feasible to test both old and new system for cfbot/CI.
> >
> 
> Hmm, I've found that running the tests under Meson takes notably longer
> than the old system - maybe 5 - 10x longer ("meson test" vs. "vcregress
> check"). I haven't yet put any effort into figuring out a cause for that
> yet.

That's because vcregress check only runs a small portion of the tests (just
the main pg_regress checks, no tap tests, no extension). Which is pretty much
my point.

To run a decent, but still far from complete, portion of the tests you needed 
to do
this:
https://github.com/postgres/postgres/blob/REL_15_STABLE/.cirrus.tasks.yml#L402-L440

If you want to run just the main regression tests with meson, you can:
  meson test --suite setup --suite regress

To see the list of all tests
  meson test --list

Greetings,

Andres Freund




Re: Meson far from ready on Windows

2024-06-22 Thread Andres Freund
Hi, 

On June 22, 2024 7:32:01 PM GMT+02:00, walt...@technowledgy.de wrote:
>Andres Freund:
>> FWIW, dynamic linking has a noticeable overhead on other platforms too. A
>> non-dependencies-enabled postgres can do about 2x the connections-per-second
>> than a fully kitted out postgres can (basically due to more memory mapping
>> metadata being copied).  But on windows the overhead is larger because so 
>> much
>> more happens for every new connections, including loading all dlls from
>> scratch.
>> 
>> I suspect linking a few libraries statically would be quite worth it on
>> windows. On other platforms it'd be quite inadvisable to statically link
>> libraries, due to security updates, [...]
>That's not necessarily true. The nix package manager and thus NixOS track all 
>dependencies for a piece of software. If any of the dependencies are updated, 
>all dependents are rebuilt, too. So the security concern doesn't apply here. 
>There is a "static overlay", which builds everything linked fully statically. 

Right. There's definitely some scenario where it's ok, I was simplifying a bit.

> Unfortunately, PostgreSQL doesn't build in that, so far.

I've built mostly statically linked pg without much of a problem, what trouble 
did you encounter? Think there were some issues with linking Kerberos and 
openldap statically, but not on postgres' side.

Building the postgres backend without support for dynamic linking doesn't make 
sense though. Extensions are just stop ingrained part of pg.


>Lately, I have been looking into building at least libpq in that static 
>overlay, via Meson. There are two related config options:
>-Ddefault_library=shared|static|both
>-Dprefer_static
>
>The first controls which libraries (libpq, ...) to build ourselves. The second 
>controls linking, IIUC also against external dependencies.

Pg by default builds a static libpq on nearly all platforms (not aix I think 
and maybe not Windows when building with autoconf, not sure about the old msvc 
system) today?


>Maybe it would be a first step to support -Dprefer_static?

That should work for nearly all dependencies today. Except for libintl, I 
think.  I found that there are a lot of buglets in static link dependencies of 
various libraries though. 


Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Meson far from ready on Windows

2024-06-22 Thread Andres Freund
Hi,

On 2024-06-21 15:36:56 +0100, Dave Page wrote:
> For giggles, I took a crack at doing that, manually creating .pc files for
> everything I've been working with so far.

Cool!


> It seems to work as expected, except that unlike everything else libintl is
> detected entirely based on whether the header and library can be found. I
> had to pass extra lib and include dirs:

Yea, right now libintl isn't using dependency detection because I didn't see
any platform where it's distributed with a .pc for or such. It'd be just a
line or two to make it use one...

Greetings,

Andres Freund




Re: Meson far from ready on Windows

2024-06-22 Thread Andres Freund
Hi,

On 2024-06-21 12:20:49 +0100, Dave Page wrote:
> On Thu, 20 Jun 2024 at 21:58, Andres Freund  wrote:
> That is precisely what https://github.com/dpage/winpgbuild/ was intended
> for - and it works well for PG <= 16.

If we develop it into that - I'd be happy. I mostly want to be able to do
automated tests on windows with all reasonable dependencies. And occasionally
do some interactive investigation, without a lot of setup time.

One small advantage of something outside of PG is that it's easy to add
additional dependencies when developing additional features. Except of course
all the windows packaging systems seem ... suboptimal.


> I don't think it's unreasonable to not support static linking, but I take
> your point.

Separately from this thread: ISTM that on windows it'd be quite beneficial to
link a few things statically, given how annoying dealing with dlls can be?
There's also the perf issue addressed further down.


> My assumption all along was that Meson would replace autoconf etc. before
> anything happened with MSVC, precisely because that's the type of
> environment all the Postgres devs work in primarily. Instead we seem to
> have taken what I think is a flawed approach of entirely replacing the
> build system on the platform none of the devs use, whilst leaving the new
> system as an experimental option on the platforms it will have had orders
> of magnitude more testing.

The old system was a major bottleneck. For one, there was no way to run all
tests. And even the tests that one could run, would run serially, leading to
exceedingly long tests times. While that could partially be addressed by
having both buildsystems in parallel, the old one would frequently break in a
way that one couldn't reproduce on other systems. And resource wise it wasn't
feasible to test both old and new system for cfbot/CI.


> What makes it worse, is that I don't believe anyone was warned about such a
> drastic change. Packagers were told about the git archive changes to the
> tarball generation, but that's it (and we've said before, we can't expect
> packagers to follow all the activity on -hackers).

I'm sure we could have dealt better with it. There certainly was some lack of
of cohesion because I wasn't able to do drive the src/tools/msvc removal and
Michael took up the slack.

But I also don't think it's really fair to say that there was no heads
up. Several people at EDB participated in the removal and buildfarm
maintainers at EDB were repeatedly pinged, to move their buildfarm animals
over.

And of course the meson stuff came out a year earlier and it wouldn't have
been exactly unreasonable


> Well as I noted, that is the point of my Github repo above. You can just go
> download the binaries from the all_deps action - you can even download the
> config.pl and buildenv.pl that will work with those dependencies (those
> files are artefacts of the postgresql action).

For the purpose of CI we'd really need debug builds of most of the libraries -
there are compat issues when mixing debug/non-debug runtimes (we've hit them
occasionally) and not using the debug runtime hides a lot of issues. Of course
also not optimal for CI / BF usage.



> > - Linking the backend dynamically against lz4, icu, ssl, xml, xslt, zstd,
> > zlib
> >   slows down the tests noticeably (~20%).  So I ended up building those
> >   statically.

> Curious. I wonder if that translates into a general 20% performance hit.
> Presumably it would for anything that looks similar to whatever test/tests
> are affected.

FWIW, dynamic linking has a noticeable overhead on other platforms too. A
non-dependencies-enabled postgres can do about 2x the connections-per-second
than a fully kitted out postgres can (basically due to more memory mapping
metadata being copied).  But on windows the overhead is larger because so much
more happens for every new connections, including loading all dlls from
scratch.

I suspect linking a few libraries statically would be quite worth it on
windows. On other platforms it'd be quite inadvisable to statically link
libraries, due to security updates, but for stuff like the EDB windows
installer dynamic linking doesn't really help with that afaict?



> >   I ran into some issue with using a static libintl. I made it work, but
> > for
> >   now reverted to a dynamic one.
> >
> >
> > - Enabling nls slows down the tests by about 15%, somewhat painful. This is
> >   when statically linking, it's a bit worse when linked dynamically :(.
> >
> 
> That one I can imagine being in psql, so maybe not a big issue for most
> real world use cases.

I think it's both psql and backend.  I think partially it's just due the
additional libraries bein

Re: Meson far from ready on Windows

2024-06-20 Thread Andres Freund
work, but for
  now reverted to a dynamic one.


- Enabling nls slows down the tests by about 15%, somewhat painful. This is
  when statically linking, it's a bit worse when linked dynamically :(.


- readline: Instead of the old issue with a compiler error, now we get a
  compiler crash:
  
https://developercommunity.visualstudio.com/t/tab-completec4023:-fatal-error-C1001:/10685868

  The issue is fairly trivial to work around, we just need to break the the
  if/else chain into two. Probably deserves a bigger refactoring, but that's
  for another day.


I haven't yet looked into a) uuid b) tcl.  I think those are the only other
missing dependencies.


> > Many of them do include at least cmake files on windows if you build them
> > though?

> The only one that does is libxml2 as far as I can see. And that doesn't
> seem to work even if I use --cmake-prefix-path= as you suggested:

Ugh, that's because they used a different name for their cmake dependency than
for pkg-config. We can add the alternative spelling to meson.build.



>
> > > And that's why we really need to be able to locate headers and libraries
> > > easily by passing paths to meson, as we can't rely on pkgconfig, cmake,
> > or
> > > things being in some standard directory on Windows.
> >
> > Except that that often causes hard to diagnose breakages, because that
> > doesn't allow including the necessary compiler/linker flags [2].  It's a
> > bad model, we shouldn't perpetuate it.  If we want to forever make windows
> > a complicated annoying stepchild, that's the way to go.
>
> That is a good point, though I suspect it wouldn't solve your second
> example of the Kerberos libraries, as you'll get both 32 and 64 bit libs if
> you follow their standard process for building on Windows so you still need
> to have code to pick the right ones.

vcpkg for one does provide .pc files for kerberos.


> > FWIW, at least libzstd, libxml [3], lz4, zlib can generate cmake dependency
> > files on windows in their upstream code.
> >
>
> In the case of zstd, it does not if you build with VC++, the Makefile, or
> Meson, at least in my testing.

When building with meson it does generate a .pc file, which does work with
PG as-is.


> It looks like it would if you built it with cmake, but I couldn't get that
> to work in 10 minutes or so of messing around.


> And that's a perfect example of what I'm bleating about - there are often
> many ways of building things on Windows and there are definitely many ways
> of getting things on Windows, and they're not all equal.

Right - but that's precisely which is why it's unreasable for postgres to know
all the ins and outs of the different file locations and compiler flags for
all those sources. Hence needing to abstract that.


> We've either got to be extremely prescriptive in our docs, telling people
> precisely what they need to download for each dependency, or how to build it
> themselves in the way that will work with PostgreSQL, or the build system
> needs to be flexible enough to handle different dependency variations, as
> the old VC++ build system was.

I'm confused - the old build system wasn't flexible around this stuff *at
all*. Everyone had to patch it to get dependencies to work, unless you chose
exactly the right source to download from - which was often not documented or
outdated.

For example:
- 
https://github.com/microsoft/vcpkg/blob/master/ports/libpq/windows/msbuild.patch
- 
https://github.com/conan-io/conan-center-index/blob/1b24f7c74994ec6573e322b7ae4111c10f620ffa/recipes/libpq/all/conanfile.py#L116-L160
- https://github.com/conda-forge/postgresql-feedstock/tree/main/recipe/patches

Greetings,

Andres Freund




Re: IPC::Run accepts bug reports

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 12:00:13 -0700, Andres Freund wrote:
> On 2024-06-18 10:10:17 -0700, Noah Misch wrote:
> > > 1) Sometimes hangs hard on windows if started processes have not been shut
> > >down before script exits.  I've mostly encountered this via the 
> > > buildfarm /
> > >CI, so I never had a good way of narrowing this down. It's very painful
> > >because things seem to often just get stuck once that happens.
> >
> > That's bad.  Do you have a link to a log, a thread discussing it, or even 
> > just
> > one of the test names experiencing it?
>
> I'm unfortunately blanking on the right keyword right now.
>
> I think it basically required not shutting down a process started in the
> background with IPC::Run.
>
> I'll try to repro it by removing some ->finish or ->quit calls.

Yep, that did it.  It reliably reproduces if I comment out
the lines below
  # explicitly shut down psql instances gracefully - to avoid hangs
  # or worse on windows
in 021_row_visibility.pl

The logfile ends in
Warning: unable to close filehandle GEN25 properly: Bad file descriptor during 
global destruction.
Warning: unable to close filehandle GEN20 properly: Bad file descriptor during 
global destruction.


Even if I cancel the test, I can't rerun it because due to a leftover psql
a) a new temp install can't be made (could be solved by rm -rf)
b) the test's logfile can't be removed (couldn't even rename the directory)

The psql instance needs to be found and terminated first.


Greetings,

Andres Freund




Re: allow changing autovacuum_max_workers without restarting

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 16:09:09 -0500, Nathan Bossart wrote:
> On Tue, Jun 18, 2024 at 01:43:34PM -0700, Andres Freund wrote:
> > I just don't see much point in reserving 256 worker "possibilities", tbh. I
> > can't think of any practical system where it makes sense to use this much 
> > (nor
> > do I think it's going to be reasonable in the next 10 years) and it's just
> > going to waste memory and startup time for everyone.
> 
> Given this, here are some options I see for moving this forward:
> 
> * lower the cap to, say, 64 or 32
> * exclude autovacuum worker slots from computing number of locks, etc.

That seems good regardless

> * make the cap configurable and default it to something low (e.g., 8)


Another one:

Have a general cap of 64, but additionally limit it to something like
 max(1, min(WORKER_CAP, max_connections / 4))

so that cases like tap tests don't end up allocating vastly more worker slots
than actual connection slots.


> My intent with a reserved set of 256 slots was to prevent users from
> needing to deal with two GUCs.  For all practical purposes, it would be
> possible to change autovacuum_max_workers whenever you want.  But if the
> extra resource requirements are too much of a tax, I'm content to change
> course.

Approximately tripling shared memory usage for tap test instances does seem
too much to me.

Greetings,

Andres Freund




Re: CI and test improvements

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 08:36:57 -0500, Justin Pryzby wrote:
> On Fri, Jun 14, 2024 at 08:34:37AM -0700, Andres Freund wrote:
> > Hm. There actually already is the mingw ccache installed. The images for 
> > mingw and msvc used to be separate (for startup performance when using 
> > containers), but we just merged them.  So it might be easiest to just 
> > explicitly use the ccache from there
> 
> > (also an explicit path might be faster).
> 
> I don't think the path search is significant; it's fast so long as
> there's no choco stub involved.

Comparatively it's definitely small, but I've seen it make a difference on
windows.


> > Could you check if that's fast?
> 
> Yes, it is.

Cool.


> > If not, we can just install the binaries distributed by the project, it's 
> > just more work to keep up2date that way. 
> 
> I guess you mean a separate line to copy choco's ccache.exe somewhere.

I was thinking of alternatively just using the binaries upstream
distributes. But the mingw way seems easier.

Perhaps we should add an environment variable pointing to ccache to the image,
or such?


>build_script: |
>  vcvarsall x64
> -ninja -C build
> +ninja -C build |tee build.txt
> +REM Since pipes lose the exit status of the preceding command, rerun the 
> compilation
> +REM without the pipe, exiting now if it fails, to avoid trying to run 
> checks
> +ninja -C build > nul

Perhaps we could do something like
(ninja -C build && touch success) | tee build.txt
cat success
?


> +CCACHE_MAXSIZE: "500M"

Does it have to be this big to work?


>configure_script: |
>  vcvarsall x64
> -meson setup --backend ninja --buildtype debug 
> -Dc_link_args=/DEBUG:FASTLINK -Dcassert=true -Dinjection_points=true 
> -Db_pch=true -Dextra_lib_dirs=c:\openssl\1.1\lib 
> -Dextra_include_dirs=c:\openssl\1.1\include -DTAR=%TAR% 
> -DPG_TEST_EXTRA="%PG_TEST_EXTRA%" build
> +meson setup build --backend ninja --buildtype debug 
> -Dc_link_args=/DEBUG:FASTLINK -Dcassert=true -Dinjection_points=true 
> -Db_pch=true -Dextra_lib_dirs=c:\openssl\1.1\lib 
> -Dextra_include_dirs=c:\openssl\1.1\include -DTAR=%TAR% 
> -DPG_TEST_EXTRA="%PG_TEST_EXTRA%" -Dc_args="/Z7"

A comment explaining why /Z7 is necessary would be good.



> From 3a399c6350ed2f341425431f184e382c3f46d981 Mon Sep 17 00:00:00 2001
> From: Justin Pryzby 
> Date: Sat, 26 Feb 2022 19:39:10 -0600
> Subject: [PATCH 4/7] WIP: cirrus: upload changed html docs as artifacts
> 
> This could be done on the client side (cfbot).  One advantage of doing
> it here is that fewer docs are uploaded - many patches won't upload docs
> at all.
> 
> https://www.postgresql.org/message-id/flat/20220409021853.gp24...@telsasoft.com
> https://www.postgresql.org/message-id/CAB8KJ=i4qmeuopq+pcsmbzgd4o-xv0fcnc+q1x7hn9hsdvk...@mail.gmail.com
> 
> https://cirrus-ci.com/task/5396696388599808

I still think that this runs the risk of increasing space utilization and thus
increase frequency of caches/artifacts being purged.



> +  # The commit that this branch is rebased on.  There's no easy way to find 
> this.

I don't think that's true, IIRC I've complained about it before. We can do
something like:

major_num=$(grep PG_MAJORVERSION_NUM build/src/include/pg_config.h|awk '{print 
$3}');
echo major: $major_num;
if git rev-parse --quiet --verify origin/REL_${major_num}_STABLE > /dev/null ; 
then
basebranch=origin/REL_${major_num}_STABLE;
else
basebranch=origin/master;
fi;
echo base branch: $basebranch
base_commit=$(git merge-base HEAD $basebranch)



Greetings,

Andres Freund




Re: allow changing autovacuum_max_workers without restarting

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 14:00:00 -0500, Nathan Bossart wrote:
> On Mon, Jun 03, 2024 at 04:24:27PM -0700, Andres Freund wrote:
> > On 2024-06-03 14:28:13 -0500, Nathan Bossart wrote:
> >> On Mon, Jun 03, 2024 at 12:08:52PM -0700, Andres Freund wrote:
> >> > Why do we think that increasing the number of PGPROC slots, heavyweight 
> >> > locks
> >> > etc by 256 isn't going to cause issues?  That's not an insubstantial 
> >> > amount of
> >> > memory to dedicate to something that will practically never be used.
> >> 
> >> I personally have not observed problems with these kinds of bumps in
> >> resource usage, although I may be biased towards larger systems where it
> >> doesn't matter as much.
> > 
> > IME it matters *more* on larger systems. Or at least used to, I haven't
> > experimented with this in quite a while.
> > 
> > It's possible that we improved a bunch of things sufficiently for this to 
> > not
> > matter anymore.
> 
> I'm curious if there is something specific you would look into to verify
> this.  IIUC one concern is the lock table not fitting into L3.  Is there
> anything else?  Any particular workloads you have in mind?

That was the main thing I was thinking of.


But I think I just thought of one more: It's going to *substantially* increase
the resource usage for tap tests.  Right now Cluster.pm has
# conservative settings to ensure we can run multiple 
postmasters:
print $conf "shared_buffers = 1MB\n";
print $conf "max_connections = 10\n";

for nodes that allow streaming.

Adding 256 extra backend slots increases the shared memory usage from ~5MB to
~18MB.


I just don't see much point in reserving 256 worker "possibilities", tbh. I
can't think of any practical system where it makes sense to use this much (nor
do I think it's going to be reasonable in the next 10 years) and it's just
going to waste memory and startup time for everyone.

Nor does it make sense to me to have the max autovac workers be independent of
max_connections.

Greetings,

Andres Freund




Re: cost delay brainstorming

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 13:50:46 -0500, Nathan Bossart wrote:
> Have we ruled out further adjustments to the cost parameters as a first
> step?

I'm not against that, but I it doesn't address the issue that with the current
logic one set of values just isn't going to fit a 60MB that's allowed to burst
to 100 iops and a 60TB database that has multiple 1M iops NVMe drives.


That said, the fact that vacuum_cost_page_hit is 1 and vacuum_cost_page_miss
is 2 just doesn't make much sense aesthetically. There's a far bigger
multiplier in actual costs than that...



> If you are still recommending that folks raise it and never recommending
> that folks lower it, ISTM that our defaults might still not be in the right
> ballpark.  The autovacuum_vacuum_cost_delay adjustment you reference (commit
> cbccac3) is already 5 years old, so maybe it's worth another look.

Adjusting cost delay much lower doesn't make much sense imo. It's already only
2ms on a 1ms granularity variable.  We could increase the resolution, but
sleeping for much shorter often isn't that cheap (you need to set up hardware
timers all the time and due to the short time they can't be combined with
other timers) and/or barely gives time to switch to other tasks.


So we'd have to increase the cost limit.


Greetings,

Andres Freund




Re: Xact end leaves CurrentMemoryContext = TopMemoryContext

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 15:28:03 -0400, Tom Lane wrote:
> After awhile I had an epiphany: what we should do is make
> CommitTransactionCommand restore the memory context that was active
> before StartTransactionCommand.  That's what we want in every place
> that was cognizant of this issue, and it seems to be the case in every
> place that wasn't doing anything explicit about it, either.

I like it.

I wonder if there's an argument the "persistent" TopTransactionContext should
live under a different name outside of transactions, to avoid references to it
working in a context where it's not valid?   It's probably not worth it, but
not sure.


> The 0001 patch attached does that, and seems to work nicely.
> I made it implement the idea of recycling TopTransactionContext,
> too

Nice.

I think there might be some benefit to doing that for some more things,
later/separately. E.g. the allocation of TopTransactionResourceOwner shows up
in profiles for workloads with small transactions. Which got a bit worse with
17 (largely bought back in other places by the advantages of the new resowner
system).



> At this point I'd be inclined to wait for the branch to be made,
> and then apply 0001 in HEAD/v18 only and 0002 in v17 and before.
> While 0001 seems fairly straightforward, it's still a nontrivial
> change and I'm hesitant to shove it in at this late stage of the
> v17 cycle.

Seems reasonable.


Greetings,

Andres Freund




Re: IPC::Run accepts bug reports

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 10:10:17 -0700, Noah Misch wrote:
> On Mon, Jun 17, 2024 at 11:11:17AM -0700, Andres Freund wrote:
> > On 2024-06-15 16:48:24 -0700, Noah Misch wrote:
> > > On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
> > > > The one
> > > > thing I know about that *I* think is a pretty big problem about Perl
> > > > is that IPC::Run is not really maintained.
> > >
> > > I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
> > > affecting PostgreSQL.  If you know of IPC::Run defects, please report 
> > > them.
> > > If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work 
> > > on
> > > it before absurdity like 
> > > https://github.com/cpan-authors/IPC-Run/issues/175
> > > NetBSD-10-specific behavior coping.
> > 
> > 1) Sometimes hangs hard on windows if started processes have not been shut
> >down before script exits.  I've mostly encountered this via the 
> > buildfarm /
> >CI, so I never had a good way of narrowing this down. It's very painful
> >because things seem to often just get stuck once that happens.
> 
> That's bad.  Do you have a link to a log, a thread discussing it, or even just
> one of the test names experiencing it?

I'm unfortunately blanking on the right keyword right now.

I think it basically required not shutting down a process started in the
background with IPC::Run.

I'll try to repro it by removing some ->finish or ->quit calls.

There's also a bunch of tests that have blocks like

# some Windows Perls at least don't like IPC::Run's start/kill_kill 
regime.
skip "Test fails on Windows perl", 2 if $Config{osname} eq 'MSWin32';

Some of them may have been related to this.


> > 2) If a subprocess dies in an inopportune moment, IPC::Run dies with "ack
> >Broken pipe:" (in _do_filters()). There's plenty reports of this on the
> >list, and I've hit this several times personally. It seems to be timing
> >dependent, I've encountered it after seemingly irrelevant ordering 
> > changes.
> > 
> >I suspect I could create a reproducer with a bit of time.
> 
> I've seen that one.  If the harness has data to write to a child, the child
> exiting before the write is one way to reach that.  Perhaps before exec(),
> IPC::Run should do a non-blocking write from each pending IO.  That way, small
> writes never experience the timing-dependent behavior.

I think the question is rather, why is ipc run choosing to die in this
situation and can that be fixed?


> > 3) It's very slow on windows (in addition to the windows process
> >slowness). That got to be fixable to some degree.
> 
> Agreed.  For the next release, today's git has some optimizations.  There are
> other known-possible Windows optimizations not implemented.

Yay!

Greetings,

Andres Freund




Re: Meson far from ready on Windows

2024-06-18 Thread Andres Freund
Hi,


On 2024-06-18 15:54:27 +0100, Dave Page wrote:
> On Tue, 18 Jun 2024 at 15:38, Andres Freund  wrote:
> > Do you have logs for those failures?
> >
>
> Sure - https://developer.pgadmin.org/~dpage/build-logs.zip. Those are all
> without any modifications to %LIB% or %INCLUDE%.

Thanks.


> > I think it's important to note that Meson largely seems to want to use
> > > pkgconfig and cmake to find dependencies. pkgconfig isn't really a thing
> > on
> > > Windows (it is available, but isn't commonly used), and even cmake would
> > > typically rely on finding things in either known installation directories
> > > or through lib/include vars.
> >
> > I am not really following what you mean with the cmake bit here?
> >
> > You can configure additional places to search for cmake files with
> > meson setup --cmake-prefix-path=...
> >
>
> None of the dependencies include cmake files for distribution on Windows,
> so there are no additional files to tell meson to search for. The same
> applies to pkgconfig files, which is why the EDB team had to manually craft
> them.

Many of them do include at least cmake files on windows if you build them
though?


Btw, I've been working with Bilal to add a few of the dependencies to the CI
images so we can test those automatically. Using vcpkg. We got that nearly
working, but he's on vacation this week...  That does ensure both cmake and
.pc files are generated, fwiw.

Currently builds gettext, icu, libxml2, libxslt, lz4, openssl, pkgconf,
python3, tcl, zlib, zstd.



I'm *NOT* sure that vcpkg is the way to go, fwiw. It does seem advantageous to
use one of the toolkits thats commonly built for building dependencies on
windows, which seems to mean vcpkg or conan.


> And that's why we really need to be able to locate headers and libraries
> easily by passing paths to meson, as we can't rely on pkgconfig, cmake, or
> things being in some standard directory on Windows.

Except that that often causes hard to diagnose breakages, because that doesn't
allow including the necessary compiler/linker flags [2].  It's a bad model, we 
shouldn't
perpetuate it.  If we want to forever make windows a complicated annoying
stepchild, that's the way to go.

FWIW, at least libzstd, libxml [3], lz4, zlib can generate cmake dependency
files on windows in their upstream code.

I'm *not* against adding "hardcoded" dependency lookup stuff for libraries
where other approaches aren't feasible, I just don't think it's a good idea to
add fragile stuff that will barely be tested, when not necessary.

Greetings,

Andres Freund


[1] Here's a build of PG with the dependencies installed, builds
https://cirrus-ci.com/task/4953968097361920

[2] E.g.

https://github.com/postgres/postgres/blob/REL_16_STABLE/src/tools/msvc/Mkvcbuild.pm#L600

https://github.com/postgres/postgres/blob/REL_16_STABLE/src/tools/msvc/Solution.pm#L1039

[3] Actually, at least your libxml build actually *did* include both .pc and
cmake files. So just pointing to the relevant path would do the trick.




Re: consider -Wmissing-variable-declarations

2024-06-18 Thread Andres Freund
Hi,

+many for doing this in principle


> -const char *EAN13_range[][2] = {
> +static const char *EAN13_range[][2] = {
>   {"000", "019"}, /* GS1 US */
>   {"020", "029"}, /* Restricted distribution (MO 
> defined) */
>   {"030", "039"}, /* GS1 US */

> -const char *ISBN_range[][2] = {
> +static const char *ISBN_range[][2] = {
>   {"0-00", "0-19"},
>   {"0-200", "0-699"},
>   {"0-7000", "0-8499"},
> @@ -967,7 +967,7 @@ const char *ISBN_range[][2] = {
>   */

I think these actually ought be "static const char *const" - right now the
table is mutable, unless this day ends in *day and I've confused myself with C
syntax again.




>  /* Hook to check passwords in CreateRole() and AlterRole() */
>  check_password_hook_type check_password_hook = NULL;
> diff --git a/src/backend/postmaster/launch_backend.c 
> b/src/backend/postmaster/launch_backend.c
> index bdfa238e4fe..bb1b0ac2b9c 100644
> --- a/src/backend/postmaster/launch_backend.c
> +++ b/src/backend/postmaster/launch_backend.c
> @@ -176,7 +176,7 @@ typedef struct
>   boolshmem_attach;
>  } child_process_kind;
>  
> -child_process_kind child_process_kinds[] = {
> +static child_process_kind child_process_kinds[] = {
>   [B_INVALID] = {"invalid", NULL, false},
>  
>   [B_BACKEND] = {"backend", BackendMain, true},

This really ought to be const as well and is new.  Unless somebody protests
I'm going to make it so soon.


Structs like these, containing pointers, make for nice helpers in
exploitation. We shouldn't make it easier by unnecessarily making them
mutable.


> diff --git a/src/bin/pg_archivecleanup/pg_archivecleanup.c 
> b/src/bin/pg_archivecleanup/pg_archivecleanup.c
> index 07bf356b70c..5a124385b7c 100644
> --- a/src/bin/pg_archivecleanup/pg_archivecleanup.c
> +++ b/src/bin/pg_archivecleanup/pg_archivecleanup.c
> @@ -19,17 +19,18 @@
>  #include "common/logging.h"
>  #include "getopt_long.h"
>  
> -const char *progname;
> +static const char *progname;

Hm, this one I'm not so sure about. The backend version is explicitly globally
visible, and I don't see why we shouldn't do the same for other binaries.



> From d89312042eb76c879d699380a5e2ed0bc7956605 Mon Sep 17 00:00:00 2001
> From: Peter Eisentraut 
> Date: Sun, 16 Jun 2024 23:52:06 +0200
> Subject: [PATCH v2 05/10] Fix warnings from -Wmissing-variable-declarations
>  under EXEC_BACKEND
> 
> The NON_EXEC_STATIC variables need a suitable declaration in a header
> file under EXEC_BACKEND.
> 
> Also fix the inconsistent application of the volatile qualifier for
> PMSignalState, which was revealed by this change.

I'm very very unenthused about adding volatile to more places. It's rarely
correct and often slow. But I guess this doesn't really make it any worse.


> +#ifdef TRACE_SYNCSCAN
> +#include "access/syncscan.h"
> +#endif

I'd just include it unconditionally.




> From f99c8712ff3dc2156c3e437cfa14f1f1a7f09079 Mon Sep 17 00:00:00 2001
> From: Peter Eisentraut 
> Date: Wed, 8 May 2024 13:49:37 +0200
> Subject: [PATCH v2 09/10] Fix -Wmissing-variable-declarations warnings for
>  float.c special case
> 
> Discussion: 
> https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c9...@eisentraut.org
> ---
>  src/backend/utils/adt/float.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
> index cbbb8aecafc..bf047ee1b4c 100644
> --- a/src/backend/utils/adt/float.c
> +++ b/src/backend/utils/adt/float.c
> @@ -56,6 +56,11 @@ static float8 cot_45 = 0;
>   * compiler to know that, else it might try to precompute expressions
>   * involving them.  See comments for init_degree_constants().
>   */
> +extern float8 degree_c_thirty;
> +extern float8 degree_c_forty_five;
> +extern float8 degree_c_sixty;
> +extern float8 degree_c_one_half;
> +extern float8 degree_c_one;
>  float8   degree_c_thirty = 30.0;
>  float8   degree_c_forty_five = 45.0;
>  float8   degree_c_sixty = 60.0;

Yikes, this is bad code. Relying on extern to have effects like this will just
break with lto. But not the responsibility of this patch.


> From 649e8086df1f175e843b26cad41a698c8c074c09 Mon Sep 17 00:00:00 2001
> From: Peter Eisentraut 
> Date: Wed, 8 May 2024 13:49:37 +0200
> Subject: [PATCH v2 10/10] Fix -Wmissing-variable-declarations warnings in
>  bison code
> 
> Add extern declarations for global variables produced by bison.

:(

Greetings,

Andres Freund




Re: Meson far from ready on Windows

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 14:53:53 +0100, Dave Page wrote:
> My next task was to extend that to support PostgreSQL 17 and beyond, which
> is where I started to run into problems. I've attempted builds using Meson
> with each of the dependencies defined in the old-style config.pl, both with
> and without modifying the INCLUDE/LIBS envvars to include the directories
> for the dependencies (as was found to work in the previous discussion re
> zlib):
> 
> Will not successfully configure at all:

Do you have logs for those failures?


> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dgssapi=enabled
> build-gssapi

> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dicu=enabled
> build-icu
> 
> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dlibxml=enabled
> build-libxml
> 
> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dlz4=enabled
> build-lz4
> 
> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dnls=enabled
> build-nls
> 
> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Duuid=ossp
> build-uuid
> 
> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dzstd=enabled
> build-zstd
> 
> Configured with modified LIBS/INCLUDE:
> 
> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dlibxslt=enabled
> build-libxslt
> 
> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dssl=openssl
> build-openssl
> 
> meson setup --auto-features=disabled
> -Dextra_include_dirs=C:\build64\include
> -Dextra_lib_dirs=C:\build64\lib;C:\build64\lib64 --wipe -Dzlib=enabled
> build-zlib
> 
> I think it's important to note that Meson largely seems to want to use
> pkgconfig and cmake to find dependencies. pkgconfig isn't really a thing on
> Windows (it is available, but isn't commonly used), and even cmake would
> typically rely on finding things in either known installation directories
> or through lib/include vars.

I am not really following what you mean with the cmake bit here?

You can configure additional places to search for cmake files with
meson setup --cmake-prefix-path=...


> There really aren't standard directories like
> /usr/lib or /usr/include as we find on unixes, or pkgconfig files for
> everything.

Yes, and?

Greetings,

Andres Freund




Re: Truncation of mapped catalogs (whether local or shared) leads to server crash

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 19:58:26 +0530, Ashutosh Sharma wrote:
> On Tue, Jun 18, 2024 at 7:50 PM Tom Lane  wrote:
> >
> > Robert Haas  writes:
> > > On Tue, Jun 18, 2024 at 8:10 AM Ashutosh Sharma  
> > > wrote:
> > >> Executing below commands:
> > >> -- set allow_system_table_mods TO on;
> > >> -- truncate table pg_type;
> >
> > > If the operation isn't allowed without turning on
> > > allow_system_table_mods, that means that doing it is probably a bad
> > > idea and will probably break stuff, as happened here.
> >
> > Nothing good can come of truncating *any* core system catalog --- what
> > do you think you'll still be able to do afterwards?
> >
> > I think the assertion you noticed is there because the code path gets
> > traversed during REINDEX, which is an operation we do support on
> > system catalogs.  I have zero interest in making truncate work
> > on them.
> >
> 
> I agree with you on that point. How about considering a complete
> restriction instead?

What's the point?  There are occasional cases where doing something dangerous
is useful, for debugging or corruption recovery. If we flat out prohibit this
we'll just need a allow_system_table_mods_but_for_real option.

Greetings,

Andres Freund




Re: CompilerWarnings task does not catch C++ warnings

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-18 09:27:02 +0200, Peter Eisentraut wrote:
> The CompilerWarnings task on Cirrus CI does not catch warnings in C++ code.
> It tries to make warnings fatal by passing COPT='-Werror', but that does not
> apply to C++ compilations.
> 
> I suggest that we just add COPT to CXXFLAGS as well.  I think passing
> -Werror is just about the only reasonable use of COPT nowadays, so making
> that more robust seems useful.  I don't think there is a need for a separate
> make variable for C++ here.

+1

Greetings,

Andres Freund




Re: may be a buffer overflow problem

2024-06-18 Thread Andres Freund
Hi,

On 2024-06-17 22:42:41 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2024-06-17 23:52:54 +0200, Daniel Gustafsson wrote:
> >> Since sqlca is, according to our docs, present in other database systems we
> >> should probably keep it a 5-char array for portability reasons.  Adding a
> >> padding character should be fine though.
> 
> > How about, additionally, adding __attribute__((nonstring))? Wrapped in an
> > attribute, of course.  That'll trigger warning for many unsafe uses, like
> > strlen().
> 
> What I was advocating for is that we make it *safe* for strlen, not
> that we double down on awkward, non-idiomatic, unsafe coding
> practices.

Given that apparently other platforms have it as a no-trailing-zero-byte
"string", I'm not sure that that is that clearly a win. Also, if they just
copy the field onto the stack or such, they'll have the same issue again.

And then there is this:

> Admittedly, I'm not sure how we could persuade compilers that
> a char[5] followed by a char field is a normal C string ...

I think the explicit backstop of a zero byte is a good idea, but I don't think
we'd just want to rely on it.

Greetings,

Andres Freund




Re: may be a buffer overflow problem

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-17 23:52:54 +0200, Daniel Gustafsson wrote:
> Since sqlca is, according to our docs, present in other database systems we
> should probably keep it a 5-char array for portability reasons.  Adding a
> padding character should be fine though.

How about, additionally, adding __attribute__((nonstring))? Wrapped in an
attribute, of course.  That'll trigger warning for many unsafe uses, like
strlen().

It doesn't quite detect the problematic case in ecpg_log() though, seems it
doesn't understand fprintf() well enough (it does trigger in simple printf()
cases, because they get reduced to puts(), which it understands).


Adding nonstring possibly allow us to re-enable -Wstringop-truncation, it 
triggers a
bunch on

../../../../../home/andres/src/postgresql/src/interfaces/ecpg/ecpglib/misc.c: 
In function ‘ECPGset_var’:
../../../../../home/andres/src/postgresql/src/interfaces/ecpg/ecpglib/misc.c:575:17:
 warning: ‘__builtin_strncpy’ output truncated before terminating nul copying 5 
bytes from a string of the same length [-Wstringop-truncation]
  575 | strncpy(sqlca->sqlstate, "YE001", 
sizeof(sqlca->sqlstate));


The only other -Wstringop-truncation warnings are in ecpg tests and at least
the first one doesn't look bogus:

../../../../../home/andres/src/postgresql/src/interfaces/ecpg/test/compat_oracle/char_array.pgc:
 In function 'main':
../../../../../home/andres/src/postgresql/src/interfaces/ecpg/test/compat_oracle/char_array.pgc:54:5:
 warning: '__builtin_strncpy' output truncated before terminating nul copying 5 
bytes from a string of the same length [-Wstringop-truncation]
   54 | strncpy(shortstr, p, sizeof shortstr);

Which seems like a valid complaint, given that shortstr is a char[5], p
is "X" and thatshortstr is printed:
printf("\"%s\": \"%s\"  %d\n", bigstr, shortstr, shstr_ind);


Greetings,

Andres Freund




Re: Inval reliability, especially for inplace updates

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-17 16:58:54 -0700, Noah Misch wrote:
> On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:
> > On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:
> > > https://postgr.es/m/20240512232923.aa.nmi...@google.com wrote:
> > > > Separable, nontrivial things not fixed in the attached patch stack:
> > > >
> > > > - Inplace update uses transactional CacheInvalidateHeapTuple().  
> > > > ROLLBACK of
> > > >   CREATE INDEX wrongly discards the inval, leading to the relhasindex=t 
> > > > loss
> > > >   still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this 
> > > > right.
> > >
> > > I plan to fix that like CacheInvalidateRelmap(): send the inval 
> > > immediately,
> > > inside the critical section.  Send it in heap_xlog_inplace(), too.

I'm worried this might cause its own set of bugs, e.g. if there are any places
that, possibly accidentally, rely on the invalidation from the inplace update
to also cover separate changes.

Have you considered instead submitting these invalidations during abort as
well?


> > > a. Within logical decoding, cease processing invalidations for inplace
> >
> > I'm attaching the implementation.  This applies atop the v3 patch stack from
> > https://postgr.es/m/20240614003549.c2.nmi...@google.com, but the threads are
> > mostly orthogonal and intended for independent review.  Translating a tuple
> > into inval messages uses more infrastructure than relmapper, which needs 
> > just
> > a database ID.  Hence, this ended up more like a miniature of inval.c's
> > participation in the transaction commit sequence.
> >
> > I waffled on whether to back-patch inplace150-inval-durability-atcommit
>
> That inplace150 patch turned out to be unnecessary.  Contrary to the
> "noncritical resource releasing" comment some lines above
> AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
> PANIC.  An ERROR just before or after sending invals becomes PANIC, "cannot
> abort transaction %u, it was already committed".

Relying on that, instead of explicit critical sections, seems fragile to me.
IIRC some of the behaviour around errors around transaction commit/abort has
changed a bunch of times. Tying correctness into something that could be
changed for unrelated reasons doesn't seem great.

I'm not sure it holds true even today - what if the transaction didn't have an
xid? Then RecordTransactionAbort() wouldn't trigger
 "cannot abort transaction %u, it was already committed"
I think?



> > - Same change, no WAL version bump.  Standby must update before primary.  
> > This
> >   is best long-term, but the transition is more disruptive.  I'm leaning
> >   toward this one, but the second option isn't bad:

Hm. The inplace record doesn't use the length of the "main data" record
segment for anything, from what I can tell. If records by an updated primary
were replayed by an old standby, it'd just ignore the additional data, afaict?

I think with the code as-is, the situation with an updated standby replaying
an old primary's record would actually be worse - it'd afaict just assume the
now-longer record contained valid fields, despite those just pointing into
uninitialized memory.  I think the replay routine would have to check the
length of the main data and execute the invalidation conditionally.


> > - heap_xlog_inplace() could set the shared-inval-queue overflow signal on
> >   every backend.  This is more wasteful, but inplace updates might be rare
> >   enough (~once per VACUUM) to make it tolerable.

We already set that surprisingly frequently, as
a) The size of the sinval queue is small
b) If a backend is busy, it does not process catchup interrupts
   (i.e. executing queries, waiting for a lock prevents processing)
c) There's no deduplication of invals, we often end up sending the same inval
   over and over.

So I suspect this might not be too bad, compared to the current badness.


At least for core code. I guess there could be extension code triggering
inplace updates more frequently? But I'd hope they'd do it not on catalog
tables... Except that we wouldn't know that that's the case during replay,
it's not contained in the record.




> > - Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE.  This isn't
> >   correct if one ends recovery between the two records, but you'd need to be
> >   unlucky to notice.  Noticing would need a procedure like the following.  A
> >   hot standby backend populates a relcache entry, then does DDL on the rel
> >   after recovery ends.

Hm. The problematic cases presumably involves an access exclusive lock? If so,
could we do LogStandbyInvalidations() *before* logging the WAL record for the
inplace update? The invalidations can't be processed by other backends until
the exclusive lock has been released, which should avoid the race?


Greetings,

Andres Freund




Re: Xact end leaves CurrentMemoryContext = TopMemoryContext

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-17 16:37:05 -0400, Tom Lane wrote:
> As to what to do about it: I'm imagining that instead of resetting
> CurrentMemoryContext to TopMemoryContext, we set it to some special
> context that we expect we can reset every so often, like at the start
> of the next transaction.  The existing TransactionAbortContext is
> a very similar thing, and maybe could be repurposed/shared with this
> usage.

One issue is that that could lead to hard to find use-after-free issues in
currently working code. Right now allocations made "between transactions"
live forever, if we just use a different context it won't anymore.

Particularly if the reset is only occasional, we'd make it hard to find
buggy allocations.

I wonder if we ought to set CurrentMemoryContext to NULL in that timeframe,
forcing code to explicitly choose what lifetime is needed, rather than just
defaulting such code into changed semantics.

Greetings,

Andres Freund




Re: cost delay brainstorming

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-17 15:39:27 -0400, Robert Haas wrote:
> As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest
> problem with autovacuum as it exists today is that the cost delay is
> sometimes too low to keep up with the amount of vacuuming that needs
> to be done.

I agree it's a big problem, not sure it's *the* problem. But I'm happy to see
it improved anyway, so it doesn't really matter.

One issue around all of this is that we pretty much don't have the tools to
analyze autovacuum behaviour across a larger number of systems in a realistic
way :/.  I find my own view of what precisely the problem is being heavily
swayed by the last few problematic cases I've looked t.



> I think we might able to get fairly far by observing that if the
> number of running autovacuum workers is equal to the maximum allowable
> number of running autovacuum workers, that may be a sign of trouble,
> and the longer that situation persists, the more likely it is that
> we're in trouble. So, a very simple algorithm would be: If the maximum
> number of workers have been running continuously for more than, say,
> 10 minutes, assume we're falling behind and exempt all workers from
> the cost limit for as long as the situation persists. One could
> criticize this approach on the grounds that it causes a very sudden
> behavior change instead of, say, allowing the rate of vacuuming to
> gradually increase. I'm curious to know whether other people think
> that would be a problem.

Another issue is that it's easy to fall behind due to cost limits on systems
where autovacuum_max_workers is smaller than the number of busy tables.

IME one common situation is to have a single table that's being vacuumed too
slowly due to cost limits, with everything else keeping up easily.


> I think it might be OK, for a couple of reasons:
>
> 1. I'm unconvinced that the vacuum_cost_delay system actually prevents
> very many problems. I've fixed a lot of problems by telling users to
> raise the cost limit, but virtually never by lowering it. When we
> lowered the delay by an order of magnitude a few releases ago -
> equivalent to increasing the cost limit by an order of magnitude - I
> didn't personally hear any complaints about that causing problems. So
> disabling the delay completely some of the time might just be fine.

I have seen disabling cost limits cause replication setups to fall over
because the amount of WAL increases beyond what can be
replicated/archived/replayed.  It's very easy to reach the issue when syncrep
is enabled.



> 1a. Incidentally, when I have seen problems because of vacuum running
> "too fast", it's not been because it was using up too much I/O
> bandwidth, but because it's pushed too much data out of cache too
> quickly. A long overnight vacuum can evict a lot of pages from the
> system page cache by morning - the ring buffer only protects our
> shared_buffers, not the OS cache. I don't think this can be fixed by
> rate-limiting vacuum, though: to keep the cache eviction at a level
> low enough that you could be certain of not causing trouble, you'd
> have to limit it to an extremely low rate which would just cause
> vacuuming not to keep up. The cure would be worse than the disease at
> that point.

I've seen versions of this too. Ironically it's often made way worse by
ringbuffers, because even if there is space is shared buffers, we'll not move
buffers there, instead putting a lot of pressure on the OS page cache.


> If you like this idea, I'd like to know that, and hear any further
> thoughts you have about how to improve or refine it. If you don't, I'd
> like to know that, too, and any alternatives you can propose,
> especially alternatives that don't require crazy amounts of new
> infrastructure to implement.

I unfortunately don't know what to do about all of this with just the existing
set of metrics :/.


One reason that cost limit can be important is that we often schedule
autovacuums on tables in useless ways, over and over. Without cost limits
providing *some* protection against using up all IO bandwidth / WAL volume, we
could end up doing even worse.


Common causes of such useless vacuums I've seen:

- Longrunning transaction prevents increasing relfrozenxid, we run autovacuum
  over and over on the same relation, using up the whole cost budget. This is
  particularly bad because often we'll not autovacuum anything else, building
  up a larger and larger backlog of actual work.

- Tables, where on-access pruning works very well, end up being vacuumed far
  too frequently, because our autovacuum scheduling doesn't know about tuples
  having been cleaned up by on-access pruning.

- Larger tables with occasional lock conflicts cause autovacuum to be
  cancelled and restarting from scratch over and over. If that happens before
  the second table scan, this can easily eat up the whole cost budget without
  making forward progress.


Greetings,

Andres Freund




Re: FYI: LLVM Runtime Crash

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-17 16:07:49 -0400, David E. Wheeler wrote:
> On Jun 17, 2024, at 11:52, David E. Wheeler  wrote:
> 
> > I don’t *think* it’s something that can be fixed in Postgres core. This is 
> > mostly in FYI in case anyone else runs into this issue or knows something 
> > more about it.
> 
> Okay, a response to the issue[1] says the bug is in Postgres:
>
> > The error message is LLVM reporting the backend can't handle the particular 
> > form of "probe-stack" attribute in the input LLVM IR. So this is likely a 
> > bug in the way postgres is generating LLVM IR: please file a bug against 
> > Postgres. (Feel free to reopen if you have some reason to believe the issue 
> > is on the LLVM side.)

I suspect the issue might be that the version of clang and LLVM are diverging
too far. Does it work if you pass CLANG=/opt/homebrew/opt/llvm/bin/clang to
configure?

Greetings,

Andres Freund




Re: Add support to TLS 1.3 cipher suites and curves lists

2024-06-17 Thread Andres Freund
Hi,

This thread was referenced by 
https://www.postgresql.org/message-id/48F0A1F8-E0B4-41F8-990F-41E6BA2A6185%40yesql.se

On 2024-06-13 14:34:27 +0800, Erica Zhang wrote:

> diff --git a/src/backend/libpq/be-secure-openssl.c 
> b/src/backend/libpq/be-secure-openssl.c
> index 39b1a66236..d097e81407 100644
> --- a/src/backend/libpq/be-secure-openssl.c
> +++ b/src/backend/libpq/be-secure-openssl.c
> @@ -1402,30 +1402,30 @@ static bool
>  initialize_ecdh(SSL_CTX *context, bool isServerStart)
>  {
>  #ifndef OPENSSL_NO_ECDH
> - EC_KEY *ecdh;
> - int nid;
> + char*curve_list = strdup(SSLECDHCurve);

ISTM we'd want to eventually rename the GUC variable to indicate it's a list?
I think the "ecdh" portion is actually not accurate anymore either, it's used
outside of ecdh if I understand correctly (probably I am not)?


> + char*saveptr;
> + char*token = strtok_r(curve_list, ":", &saveptr);
> + int nid;
>  
> - nid = OBJ_sn2nid(SSLECDHCurve);
> - if (!nid)
> + while (token != NULL)

It'd be good to have a comment explaining why we're parsing the list ourselves
instead of using just the builtin SSL_CTX_set1_groups_list().

>   {
> - ereport(isServerStart ? FATAL : LOG,
> + nid = OBJ_sn2nid(token);
> + if (!nid)
> + {ereport(isServerStart ? FATAL : LOG,
>   (errcode(ERRCODE_CONFIG_FILE_ERROR),
> -  errmsg("ECDH: unrecognized curve name: %s", 
> SSLECDHCurve)));
> +  errmsg("ECDH: unrecognized curve name: %s", 
> token)));
>   return false;
> + }
> + token = strtok_r(NULL, ":", &saveptr);
>   }
>  
> - ecdh = EC_KEY_new_by_curve_name(nid);
> - if (!ecdh)
> + if(SSL_CTX_set1_groups_list(context, SSLECDHCurve) !=1)
>   {
>   ereport(isServerStart ? FATAL : LOG,
>   (errcode(ERRCODE_CONFIG_FILE_ERROR),
> -  errmsg("ECDH: could not create key")));
> +  errmsg("ECDH: failed to set curve names")));

Probably worth including the value of the GUC here?



This also needs to change the documentation for the GUC.



Once we have this parameter we probably should add at least x25519 to the
allowed list, as that's the client side default these days.

But that can be done in a separate patch.

Greetings,

Andres Freund




Re: Direct SSL connection and ALPN loose ends

2024-06-17 Thread Andres Freund
Hi,

On 2024-04-29 18:24:04 +0300, Heikki Linnakangas wrote:
> All reported bugs have now been fixed. We now enforce ALPN in all the right
> places. Please let me know if I missed something.

Very minor and not really your responsibility:

If provided with the necessary key information, wireshark can decode TLS
exchanges when using sslnegotiation=postgres but not with direct. Presumably
it needs to be taught postgres' ALPN id or something.

Example with direct:

  476 6513.310308457 192.168.0.113 → 192.168.0.200 48978 5432 142 TLSv1.3 
Finished
  477 6513.310341492 192.168.0.113 → 192.168.0.200 48978 5432 151 TLSv1.3 
Application Data
  478 6513.320730295 192.168.0.200 → 192.168.0.113 5432 48978 147 TLSv1.3 New 
Session Ticket
  479 6513.320745684 192.168.0.200 → 192.168.0.113 5432 48978 147 TLSv1.3 New 
Session Ticket
  480 6513.321175713 192.168.0.113 → 192.168.0.200 48978 5432 68 TCP 48978 → 
5432 [ACK] Seq=915 Ack=1665 Win=62848 Len=0 TSval=3779915421 TSecr=3469016093
  481 6513.323161553 192.168.0.200 → 192.168.0.113 5432 48978 518 TLSv1.3 
Application Data
  482 6513.323626180 192.168.0.113 → 192.168.0.200 48978 5432 125 TLSv1.3 
Application Data
  483 6513.333977769 192.168.0.200 → 192.168.0.113 5432 48978 273 TLSv1.3 
Application Data
  484 6513.334581920 192.168.0.113 → 192.168.0.200 48978 5432 95 TLSv1.3 
Application Data
  485 6513.334666116 192.168.0.113 → 192.168.0.200 48978 5432 92 TLSv1.3 Alert 
(Level: Warning, Description: Close Notify)

Example with postgres:

  502 6544.752799560 192.168.0.113 → 192.168.0.200 46300 5432 142 TLSv1.3 
Finished
  503 6544.752842863 192.168.0.113 → 192.168.0.200 46300 5432 151 PGSQL >?
  504 6544.76315 192.168.0.200 → 192.168.0.113 5432 46300 147 TLSv1.3 New 
Session Ticket
  505 6544.763163155 192.168.0.200 → 192.168.0.113 5432 46300 147 TLSv1.3 New 
Session Ticket
  506 6544.763587595 192.168.0.113 → 192.168.0.200 46300 5432 68 TCP 46300 → 
5432 [ACK] Seq=923 Ack=1666 Win=62848 Len=0 TSval=3779946864 TSecr=3469047536
  507 6544.765024827 192.168.0.200 → 192.168.0.113 5432 46300 518 PGSQL 
Q
  509 6544.776974164 192.168.0.200 → 192.168.0.113 5432 46300 273 PGSQL 
X
  511 6544.777631520 192.168.0.113 → 192.168.0.200 46300 5432 92 TLSv1.3 Alert 
(Level: Warning, Description: Close Notify)

Note that in the second one it knows what's inside the "Application Data"
messages and decodes them (S: startup, authentication ok, parameters, cancel 
key,
ready for query, C: simple query, S: description, 10 rows, command complete,
ready for query).

In the GUI you can obviously go into the "postgres messages" in more detail
than I know how to do on the console.



A second aspect is that I'm not super happy about the hack of stashing data
into Port.  I think medium term we'd be better off separating out the
buffering for unencrypted and encrypted data properly. It turns out that not
having any buffering *below* openssl (i.e. the encrypted data) hurts both for
the send and receive side, due to a) increased number of syscalls b) too many
small packets being sent, as we use TCP_NODELAY c) kernel memory copies being
slower due to the small increments.

Greetings,

Andres Freund




Re: IPC::Run accepts bug reports

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-15 16:48:24 -0700, Noah Misch wrote:
> Separating this from the pytest thread:
>
> On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
> > The one
> > thing I know about that *I* think is a pretty big problem about Perl
> > is that IPC::Run is not really maintained.
>
> I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
> affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
> If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
> it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
> NetBSD-10-specific behavior coping.

1) Sometimes hangs hard on windows if started processes have not been shut
   down before script exits.  I've mostly encountered this via the buildfarm /
   CI, so I never had a good way of narrowing this down. It's very painful
   because things seem to often just get stuck once that happens.

2) If a subprocess dies in an inopportune moment, IPC::Run dies with "ack
   Broken pipe:" (in _do_filters()). There's plenty reports of this on the
   list, and I've hit this several times personally. It seems to be timing
   dependent, I've encountered it after seemingly irrelevant ordering changes.

   I suspect I could create a reproducer with a bit of time.

3) It's very slow on windows (in addition to the windows process
   slowness). That got to be fixable to some degree.

Greetings,

Andres Freund




Re: ecdh support causes unnecessary roundtrips

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-17 19:51:45 +0200, Daniel Gustafsson wrote:
> > On 17 Jun 2024, at 19:44, Andres Freund  wrote:
> 
> >> Let's bring that to Erica's patch for allowing a list of curves.
> > 
> > I'm kinda wondering if we ought to do something about this in the
> > backbranches. Forcing unnecessary roundtrips onto everyone for the next five
> > years due to an oversight on our part isn't great.  Once you're not local, 
> > the
> > roundtrip does measurably increase the "time to first query".
> 
> I don't disagree, but wouldn't it be the type of behavioural change which we
> typically try to avoid in backbranches?

Yea, it's not great. Not sure what the right thing is here.


> Changing the default of the ecdh GUC would perhaps be doable?

I was wondering whether we could change the default so that it accepts both
x25519 and secp256r1. Unfortunately that seems to requires changing what we
use to set the parameter...


> (assuming that's a working solution to avoid the roundtrip).

It is.


> Amending the documentation is the one thing we certainly can do but 99.9% of
> affected users won't know they are affected so won't look for that section.

Yea. It's also possible that some other bindings changed their default to
match ours...

Greetings,

Andres Freund




Re: ecdh support causes unnecessary roundtrips

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-17 19:29:47 +0200, Daniel Gustafsson wrote:
> >> I wonder if that made OpenSSL override the min protocol version and switch
> >> to a TLS1.3 ClientHello since it otherwise couldn't announce the curve.
> >
> > The client seems to announce the curve in the initial ClientHello even with
> > 1.3 as the minimum version.
>
> With 1.3 it should announce it in ClientHello, do you mean that it's announced
> when 1.2 is the minimum version as well?  It does make sense since a 1.2 
> server
> is defined to disregard all extensions.

Yes, it's announced even when 1.2 is the minimum:

Extension: supported_versions (len=5) TLS 1.3, TLS 1.2
Type: supported_versions (43)
Length: 5
Supported Versions length: 4
Supported Version: TLS 1.3 (0x0304)
Supported Version: TLS 1.2 (0x0303)
...
Extension: key_share (len=38) x25519
Type: key_share (51)
Length: 38
Key Share extension



> Let's bring that to Erica's patch for allowing a list of curves.

I'm kinda wondering if we ought to do something about this in the
backbranches. Forcing unnecessary roundtrips onto everyone for the next five
years due to an oversight on our part isn't great.  Once you're not local, the
roundtrip does measurably increase the "time to first query".

Greetings,

Andres Freund




tls 1.3: sending multiple tickets

2024-06-17 Thread Andres Freund
Hi,

To investigate an unrelated issue, I set up key logging in the backend (we
probably should build that in) and looked at the decrypted data.  And noticed
that just after TLS setup finished the server sends three packets in a row:

C->S: TLSv1.3: finished
C->S: TLSv1.3: application data (startup message)
S->C: TLSv1.3: new session ticket
S->C: TLSv1.3: new session ticket
S->C: TLSv1.3: application data (authentication ok, parameter status+)


We try to turn off session resumption, but not completely enough for 1.3:
   SSL_OP_NO_TICKET
   SSL/TLS supports two mechanisms for resuming sessions: session ids 
and stateless session tickets.

   When using session ids a copy of the session information is cached 
on the server and a unique id is sent to the client. When the client  wishes  to
   resume it provides the unique id so that the server can retrieve the 
session information from its cache.

   When  using  stateless  session  tickets the server uses a session 
ticket encryption key to encrypt the session information. This encrypted data is
   sent to the client as a "ticket". When the client wishes to resume 
it sends the encrypted data back to the server.  The  server  uses  its  key  to
   decrypt the data and resume the session. In this way the server can 
operate statelessly - no session information needs to be cached locally.

   The  TLSv1.3  protocol  only  supports  tickets and does not 
directly support session ids. However, OpenSSL allows two modes of ticket 
operation in
   TLSv1.3: stateful and stateless. Stateless tickets work the same way 
as in TLSv1.2 and below.  Stateful tickets  mimic  the  session  id  behaviour
   available  in TLSv1.2 and below.  The session information is cached 
on the server and the session id is wrapped up in a ticket and sent back to the
   client. When the client wishes to resume, it presents a ticket in 
the same way as for stateless tickets. The server can then extract the session 
id
   from the ticket and retrieve the session information from its cache.

   By default OpenSSL will use stateless tickets. The SSL_OP_NO_TICKET 
option will cause stateless tickets to not be issued. In TLSv1.2 and below this
   means no ticket gets sent to the client at all. In TLSv1.3 a 
stateful ticket will be sent. This is a server-side option only.

   In TLSv1.3 it  is  possible  to  suppress  all  tickets  (stateful  
and  stateless)  from  being  sent  by  calling  SSL_CTX_set_num_tickets(3)  or
   SSL_set_num_tickets(3).


Note the second to last paragraph: Because we use SSL_OP_NO_TICKET we trigger
use of stateful tickets. Which afaict are never going to be useful, because we
don't share the necessary state.

I guess openssl really could have inferred this from the fact that we *do*
call SSL_CTX_set_session_cache_mode(SSL_SESS_CACHE_OFF), b


Seems we ought to use SSL_CTX_set_num_tickets() to prevent issuing the useless
tickets?



It seems like a buglet in openssl that it forces each session tickets to be
sent in its own packet (it does an explicit BIO_flush(), so even if we
buffered between openssl and OS, as I think we should, we'd still send it
separately), but I don't really understand most of this stuff.

Greetings,

Andres Freund




Re: ecdh support causes unnecessary roundtrips

2024-06-17 Thread Andres Freund
Hi,

On 2024-06-17 12:00:30 +0200, Daniel Gustafsson wrote:
> > On 17 Jun 2024, at 01:46, Andres Freund  wrote:
>
> > When connecting with a libpq based client, the TLS establishment ends up 
> > like
> > this in many configurations;
> >
> > C->S: TLSv1 393 Client Hello
> > S->C: TLSv1.3 167 Hello Retry Request, Change Cipher Spec
> > C->S: TLSv1.3 432 Change Cipher Spec, Client Hello
> > S->C: TLSv1.3 1407 Server Hello, Application Data, Application Data, 
> > Application Data, Application Data
>
> I wonder if this is the result of us still using TLS 1.2 as the default 
> minimum
> protocol version.  In 1.2 the ClientHello doesn't contain the extensions for
> cipher and curves, so the server must reply with a HRR (Hello Retry Request)
> message asking the client to set protocol, curve and cipher.  The client will
> respond with a new Client Hello using 1.3 with the extensions.

I'm pretty sure it's not that, given that
a) removing the server side SSL_CTX_set_tmp_ecdh() call
b) adding a client side SSL_CTX_set_tmp_ecdh() call
avoid the roundtrip.

I've now also confirmed that ssl_min_protocol_version=TLSv1.3 doesn't change
anything (relevant, of course the indicated version support changes).


> > This appears to be caused by ECDH support.  The difference between the two
> > ClientHellos is
> > -Extension: key_share (len=38) x25519
> > +Extension: key_share (len=71) secp256r1
> >
> > I.e. the clients wanted to use x25519, but the server insists on secp256r1.
>
> Somewhat related, Erica Zhang has an open patch to make the server-side curves
> configuration take a list rather than a single curve [0], and modernizing the
> API used as a side effect (SSL_CTX_set_tmp_ecdh is documented as obsoleted by
> OpenSSL, but not deprecated with an API level).

That does seem nicer. Fun coincidence in timing.


> > I don't know if it's good that we're calling SSL_CTX_set_tmp_ecdh at all,
>
> To set the specified curve in ssl_ecdh_curve we have to don't we?

Sure, but it's not obvious to me why we actually want to override openssl's
defaults here. There's not even a parameter to opt out of forcing a specific
choice on the server side.


> > but if it is, shouldn't we at least do the same in libpq, so we don't 
> > introduce
> > unnecessary roundtrips?
>
> If we don't set the curve in the client I believe OpenSSL will pass the set of
> supported curves the client has, which then should allow the server to pick 
> the
> one it wants based on ssl_ecdh_curve, so ideally we shouldn't have to I think.

Afaict the client sends exactly one:

Transport Layer Security
TLSv1.3 Record Layer: Handshake Protocol: Client Hello
Content Type: Handshake (22)
Version: TLS 1.0 (0x0301)
Length: 320
Handshake Protocol: Client Hello
...
Extension: supported_versions (len=5) TLS 1.3, TLS 1.2
Type: supported_versions (43)
Length: 5
Supported Versions length: 4
Supported Version: TLS 1.3 (0x0304)
Supported Version: TLS 1.2 (0x0303)
Extension: psk_key_exchange_modes (len=2)
Type: psk_key_exchange_modes (45)
Length: 2
PSK Key Exchange Modes Length: 1
PSK Key Exchange Mode: PSK with (EC)DHE key establishment 
(psk_dhe_ke) (1)
Extension: key_share (len=38) x25519
Type: key_share (51)
Length: 38
Key Share extension
Extension: compress_certificate (len=5)
Type: compress_certificate (27)
...

Note key_share being set to x25519.


The HRR says:

Extension: key_share (len=2) secp256r1
Type: key_share (51)
Length: 2
Key Share extension
Selected Group: secp256r1 (23)


> > I did confirm that doing the same thing on the client side removes the
> > additional roundtrip.
>
> The roundtrip went away because the client was set to use secp256r1?

Yes. Or if I change the server to not set the ecdh curve.


> I wonder if that made OpenSSL override the min protocol version and switch
> to a TLS1.3 ClientHello since it otherwise couldn't announce the curve.

The client seems to announce the curve in the initial ClientHello even with
1.3 as the minimum version.

What *does* make the HRR go away is setting ssl_max_protocol_version=TLSv1.2
on the client side.


https://wiki.openssl.org/index.php/TLS1.3 says:

> In practice most clients will use X25519 or P-256 for their initial
> key_share. For maximum performance it is recommended that servers are
> configured to support at least those two groups and clients use one of those
> two for its initial key_share. This is the default case (OpenSSL clients
> will use X25519).

We're not allowing both groups and the client defaults to X25519, hence
the HRR.


> If you force the client min protocol to 1.3 in an unpatched client, do you
> see the same speedup?

Nope.

Greetings,

Andres Freund




Re: Using LibPq in TAP tests via FFI

2024-06-16 Thread Andres Freund
Hi,

On 2024-06-17 12:03:28 +1200, Thomas Munro wrote:
> And then maybe Macs if Tom doesn't beat me to it.

macports even has a platypus package, so that should be easy.

For CI it should suffice to add p5.34-ffi-platypus to the list of packages in
macos' setup_additional_packages_script, they then should get automatically
cached.

Greetings,

Andres Freund




ecdh support causes unnecessary roundtrips

2024-06-16 Thread Andres Freund
Hi,

When connecting with a libpq based client, the TLS establishment ends up like
this in many configurations;

C->S: TLSv1 393 Client Hello
S->C: TLSv1.3 167 Hello Retry Request, Change Cipher Spec
C->S: TLSv1.3 432 Change Cipher Spec, Client Hello
S->C: TLSv1.3 1407 Server Hello, Application Data, Application Data, 
Application Data, Application Data
...

I.e. there are two clients hellos, because the server rejects the clients
"parameters".


This appears to be caused by ECDH support.  The difference between the two
ClientHellos is
-Extension: key_share (len=38) x25519
+Extension: key_share (len=71) secp256r1

I.e. the clients wanted to use x25519, but the server insists on secp256r1.


This turns out to be due to

commit 3164721462d547fa2d15e2a2f07eb086a3590fd5
Author: Peter Eisentraut 
Date:   2013-12-07 15:11:44 -0500

SSL: Support ECDH key exchange



I don't know if it's good that we're calling SSL_CTX_set_tmp_ecdh at all, but
if it is, shouldn't we at least do the same in libpq, so we don't introduce
unnecessary roundtrips?

I did confirm that doing the same thing on the client side removes the
additional roundtrip.



It seems kind of a shame that we have fewer roundtrips due to
sslnegotiation=direct, but do completely unnecessary roundtrips all the
time...


In a network with ~10ms latency I see an almost 30% increased
connections-per-second for a single client if I avoid the added roundtrip.

I think this could almost be considered a small bug...

Greetings,

Andres Freund




Re: Using LibPq in TAP tests via FFI

2024-06-16 Thread Andres Freund
Hi,

On 2024-06-16 19:07:49 -0400, Tom Lane wrote:
> Thomas Munro  writes:
> > On Mon, Jun 17, 2024 at 10:38 AM Andres Freund  wrote:
> >> I'm somewhat doubtful we'll just agree on making FFI::Platypus a hard
> >> dependency, but if we agree to do so...
> 
> > Why can't we just do that?  I mean, do we have any concrete reason to
> > think that it'll block a supported platform?
> 
> IIUC, this would only be a hard dependency if you want to run certain
> TAP tests (maybe eventually all of them).

I think it'd be all of them within a very short timeframe. IMO we'd want to
convert a bunch of the code in Cluster.pm to use psql-less connections to
maximize the benefit across all tests, without needing to modify all of them.

Greetings,

Andres Freund




Re: Using LibPq in TAP tests via FFI

2024-06-16 Thread Andres Freund
Hi,

On 2024-06-16 17:43:05 -0400, Andrew Dunstan wrote:
> On Fri, Jun 14, 2024 at 12:25 PM Andres Freund  wrote:
> I guess it's a question of how widely available FFI::Platypus is. I know
> it's available pretty much out of the box on Strawberry Perl and Msys2'
> ucrt perl.

FWIW I hacked a bit on CI, trying to make it work. Took a bit, partially
because CI uses an older strawberry perl without FFI::Platypus. And
FFI::Platypus didn't build with that.


Updating that to 5.38 causes some complaints about LANG that I haven't hunted
down, just avoided by unsetting LANG.


As-is your patch didn't work, because it has "systempath => []", which caused
libpq to not load, because it depended on things in the system path...


What's the reason for that?

After commenting that out, all but one tests passed:

[20:21:31.137] - 8< 
-
[20:21:31.137] stderr:
[20:21:31.137] #   Failed test 'psql connect success'
[20:21:31.137] #   at 
C:/cirrus/src/test/recovery/t/041_checkpoint_at_promote.pl line 161.
[20:21:31.137] #  got: '2'
[20:21:31.137] # expected: '0'
[20:21:31.137] #   Failed test 'psql select 1'
[20:21:31.137] #   at 
C:/cirrus/src/test/recovery/t/041_checkpoint_at_promote.pl line 162.
[20:21:31.137] #  got: ''
[20:21:31.137] # expected: '1'
[20:21:31.137] # Looks like you failed 2 tests of 6.
[20:21:31.137]
[20:21:31.137] (test program exited with status code 2)
[20:21:31.137] 
--
[20:21:31.137]


Due to concurrency and run-to-run variance I wouldn't bet too much on this,
but the modified tests do have improved test times:

before:

[19:40:47.468] 135/296 postgresql:pg_amcheck / pg_amcheck/004_verify_heapam 
   OK7.70s   32 subtests passed
[19:43:40.853] 232/296 postgresql:amcheck / amcheck/001_verify_heapam   
   OK   36.50s   272 subtests passed

after:
[20:22:55.495] 133/296 postgresql:pg_amcheck / pg_amcheck/004_verify_heapam 
   OK4.60s   32 subtests passed
[20:25:13.641] 212/296 postgresql:amcheck / amcheck/001_verify_heapam   
   OK4.87s   272 subtests passed


I looked at a few past runs and there never were instances of
amcheck/001_verify_heapam that were even close to as fast as this.


The overall tests time did improve some, but that is hard to weigh due to the
test failure.


> I agree with you that falling back on BackgroundPsql is not a terribly
> satisfactory solution.

I'm somewhat doubtful we'll just agree on making FFI::Platypus a hard
dependency, but if we agree to do so...


Greetings,

Andres Freund




Re: RFC: adding pytest as a supported test framework

2024-06-15 Thread Andres Freund
Hi,

On 2024-06-15 10:45:16 -0400, Andrew Dunstan wrote:
> > 4. Pytest has autodiscovery of test files and functions, so we
> > probably wouldn't have to specify all of the exact test files anymore
> > in the meson.build files.
> 
> 
> I find this comment a bit ironic. We don't need to do that with the
> Makefiles, and the requirement to do so was promoted as a meson feature
> rather than a limitation, ISTR.

The reason its good to have the list of tests somewhere explicit is that we
have different types of test. With make, there is a single target for all tap
tests. If you want to run tests concurrently, make can only schedule the tap
tests at the granularity of a directory. If you want concurrency below that,
you need to use concurrency on the prove level. But that means that you have
extremely varying concurrency, depending on whether make runs targets that
have no internal concurrency or make runs e.g. the recovery tap tests.

I don't think we should rely on global test discovery via pytest. That'll lead
to uncontrollable concurrency again, which means much longer test times. We'll
always have different types of tests, just scheduling running them via
"top-level" tools for different test types just won't work well. That's not
true for many projects where tests have vastly lower resource usage.

Greetings,

Andres Freund




Re: libpq contention due to gss even when not using gss

2024-06-14 Thread Andres Freund
Hi,

On 2024-06-14 12:27:12 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Initializing the gss cache at all isn't so much the problem. It's that we do
> > it for every connection. And that doing so requires locking inside gss. So
> > maybe we could just globally cache that gss isn't available, instead of
> > rediscovering it over and over for every new connection.
> 
> I had the impression that krb5 already had such a cache internally.

Well, if so, it clearly doesn't seem to work very well, given that it causes
contention at ~15k lookups/sec. That's obviously a trivial number for anything
cached, even with the worst possible locking regimen.


> Maybe they don't cache the "failed" state though.  I doubt we'd
> want to either in long-lived processes --- what if the user
> installs the credential while we're running?

If we can come up with something better - cool. But it doesn't seem great that
gss introduces contention for the vast majority of folks that use libpq in
environments that never use gss.

I don't think we should cache the set of credentials when gss is actually
available on a process-wide basis, just the fact that gss isn't available at
all.  I think it's very unlikely for that fact to change while an application
is running.  And if it happens, requiring a restart in those cases seems an
acceptable price to pay for what is effectively a niche feature.

Greetings,

Andres Freund




Re: Using LibPq in TAP tests via FFI

2024-06-14 Thread Andres Freund
Hi,

On 2024-06-14 11:11:38 -0400, Andrew Dunstan wrote:
> On 2024-06-14 Fr 11:09, Andrew Dunstan wrote:
> > Over at [1] Andres expressed enthusiasm for enabling TAP tests to call
> > LibPQ directly via FFI, and there was some support from others as well.
> > Attached is a very rough POC for just that.There are two perl modules,
> > one which wraps libpq (or almost all of it) in perl, and another which
> > uses that module to create a session object that can be used to run SQL.

What are your current thoughts about a fallback for this?  It seems possible
to implement the session module ontop of BackgroundPsql.pm, if necessary. But
I suspect we'll eventually get to a point where that gets less and less
convenient.


How much of a dependency is FFI::Platypus, compared to requiring perl headers
to be installed?  In case FFI::Platypus is a complicted dependency, a small XS
wrapper could be an alternative.

Greetings,

Andres Freund




Re: RFC: adding pytest as a supported test framework

2024-06-14 Thread Andres Freund
On 2024-06-14 09:11:11 -0700, Andres Freund wrote:
> On 2024-06-14 11:49:29 -0400, Tom Lane wrote:
> > Am I right in guessing that pytest will have nothing to do with that?
> 
> Looks like there's a plugin for pytest to support tap as output:
> https://pypi.org/project/pytest-tap/
> 
> However, it's not available as a debian package. I know that some folks just
> advocate installing dependencies via venv, but I personally don't think
> that'll fly. For one, it'll basically prevent tests being run by packagers.

If this were the blocker, I think we could just ship an output adapter
ourselves. pytest-tap is not a lot of code:
https://github.com/python-tap/pytest-tap/blob/main/src/pytest_tap/plugin.py

So either vendoring it or just writing an even simpler version ourselves seems
entirely feasible.




Re: RFC: adding pytest as a supported test framework

2024-06-14 Thread Andres Freund
Hi,

On 2024-06-14 11:49:29 -0400, Tom Lane wrote:
> I also wonder about integration of python-based testing with what
> we already have.  A significant part of what you called the meson
> work had to do with persuading pg_regress, isolationtester, etc
> to output test results in the common format established by TAP.

FWIW, meson's testrunner doesn't require TAP, the lowest common denominator is
just an exit code. However, for things that run many "sub" tests, it's a lot
nicer if the failures can be reported more granularly than just "the entire
testsuite failed".

Meson currently supports:

exitcode: the executable's exit code is used by the test harness to record 
the outcome of the test).

tap: Test Anything Protocol.

gtest (since 0.55.0): for Google Tests.

rust (since 0.56.0): for native rust tests


> Am I right in guessing that pytest will have nothing to do with that?

Looks like there's a plugin for pytest to support tap as output:
https://pypi.org/project/pytest-tap/

However, it's not available as a debian package. I know that some folks just
advocate installing dependencies via venv, but I personally don't think
that'll fly. For one, it'll basically prevent tests being run by packagers.


> Can we even manage to dump perl and python test scripts into the same
> subdirectory and sort things out automatically?

That shouldn't be much of a problem.


> I'm definitely going to be -1 for a python testing feature that cannot
> integrate with what we have because it demands its own control and
> result-collection infrastructure.

Dito.

Greetings,

Andres Freund




Re: libpq contention due to gss even when not using gss

2024-06-14 Thread Andres Freund
Hi,

On 2024-06-14 10:46:04 +0200, Dmitry Dolgov wrote:
> At the same time after quick look I don't see an easy way to avoid that.
> Current implementation tries to initialize gss before getting any
> confirmation from the server whether it's supported. Doing this other
> way around would probably just shift overhead to the server side.

Initializing the gss cache at all isn't so much the problem. It's that we do
it for every connection. And that doing so requires locking inside gss. So
maybe we could just globally cache that gss isn't available, instead of
rediscovering it over and over for every new connection.

Greetings,

Andres Freund




Re: Using LibPq in TAP tests via FFI

2024-06-14 Thread Andres Freund
Hi, 

On June 14, 2024 8:09:43 AM PDT, Andrew Dunstan  wrote:
>Over at [1] Andres expressed enthusiasm for enabling TAP tests to call LibPQ 
>directly via FFI, and there was some support from others as well. Attached is 
>a very rough POC for just that.There are two perl modules, one which wraps 
>libpq (or almost all of it) in perl, and another which uses that module to 
>create a session object that can be used to run SQL. Also in the patch is a 
>modification of one TAP test (arbitrarily chosen as 
>src/bin/pg_amcheck/t/004_verify_heapam.p) to use the new interface, so it 
>doesn't use psql at all.
>
>There's a bunch of work to do here, but for a morning's work it's not too bad 
>:-) Luckily I had most of the first file already to hand.

Yay!


>Next I plan to look at some of the recovery tests and other uses of 
>background_psql, which might be more challenging,a dn require extension of the 
>session API. Also there's a lot of error checking and documentation that need 
>to be added.

I'd suggest trying to convert the various looping constructs first, they're 
responsible for a large number of spawned shells. And I vaguely recall that 
there were none/very few that depend on actually being run via psql. 

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: CI and test improvements

2024-06-14 Thread Andres Freund
Hi, 

On June 14, 2024 8:22:01 AM PDT, Justin Pryzby  wrote:
>On Fri, Jun 14, 2024 at 05:36:54PM +0300, Nazir Bilal Yavuz wrote:
>> Hi,
>> 
>> On Thu, 13 Jun 2024 at 14:56, Justin Pryzby  wrote:
>> >
>> > ccache should be installed in the image rather than re-installed on each
>> > invocation.
>> 
>> ccache is installed in the Windows VM images now [1]. It can be used
>> as 'set CC=ccache.exe cl.exe' in the Windows CI task.
>> 
>> [1] 
>> https://github.com/anarazel/pg-vm-images/commit/03a9225ac962fb30b5c0722c702941e2d7c1e81e
>
>Thanks.  I think that works by using a "shim" created by choco in
>C:\ProgramData\chocolatey\bin.
>
>But going through that indirection seems to incur an extra 15sec of
>compilation time; I think we'll want to do something to avoid that.
>
>I'm not sure what the options are, like maybe installing ccache into a
>fixed path like c:\ccache or installing a custom link, to a "pinned"
>version of ccache.


Hm. There actually already is the mingw ccache installed. The images for mingw 
and msvc used to be separate (for startup performance when using containers), 
but we just merged them.  So it might be easiest to just explicitly use the 
ccache from there (also an explicit path might be faster). Could you check if 
that's fast? If not, we can just install the binaries distributed by the 
project, it's just more work to keep up2date that way. 

Andres 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: libpq contention due to gss even when not using gss

2024-06-13 Thread Andres Freund
Hi,

On 2024-06-13 17:33:57 +0200, Dmitry Dolgov wrote:
> > On Mon, Jun 10, 2024 at 11:12:12AM GMT, Andres Freund wrote:
> > Hi,
> >
> > To investigate a report of both postgres and pgbouncer having issues when a
> > lot of new connections aree established, I used pgbench -C.  Oddly, on an
> > early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
> > only when using TCP, not with unix sockets.
> >
> > c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 
> > host=127.0.0.1 user=test dbname=postgres password=fake'
> >
> > host=127.0.0.1:   16465
> > host=127.0.0.1,gssencmode=disable 20860
> > host=/tmp:49286
> >
> > Note that the server does *not* support gss, yet gss has a substantial
> > performance impact.
> >
> > Obviously the connection rates here absurdly high and outside of badly 
> > written
> > applications likely never practically relevant. However, the number of cores
> > in systems are going up, and this quite possibly will become relevant in 
> > more
> > realistic scenarios (lock contention kicks in earlier the more cores you
> > have).
> 
> By not supporting gss I assume you mean having built with --with-gssapi,
> but only host (not hostgssenc) records in pg_hba, right?

Yes, the latter. Or not having kerberos set up on the client side.

Greetings,

Andres Freund




Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-06-12 Thread Andres Freund
Hi,

On 2024-05-01 15:56:08 +, Amonson, Paul D wrote:
> Comparing the current SSE4.2 implementation of the CRC32C algorithm in
> Postgres, to an optimized AVX-512 algorithm [0] we observed significant
> gains. The result was a ~6.6X average multiplier of increased performance
> measured on 3 different Intel products. Details below. The AVX-512 algorithm
> in C is a port of the ISA-L library [1] assembler code.
>
> Workload call size distribution details (write heavy):
>* Average was approximately around 1,010 bytes per call
>* ~80% of the calls were under 256 bytes
>* ~20% of the calls were greater than or equal to 256 bytes up to the max 
> buffer size of 8192

This is extremely workload dependent, it's not hard to find workloads with
lots of very small record and very few big ones...  What you observed might
have "just" been the warmup behaviour where more full page writes have to be
written.

There a very frequent call computing COMP_CRC32C over just 20 bytes, while
holding a crucial lock.  If we were to do introduce something like this
AVX-512 algorithm, it'd probably be worth to dispatch differently in case of
compile-time known small lengths.


How does the latency of the AVX-512 algorithm compare to just using the CRC32C
instruction?


FWIW, I tried the v2 patch on my Xeon Gold 5215 workstation, and dies early on
with SIGILL:

Program terminated with signal SIGILL, Illegal instruction.
#0  0x00d5946c in _mm512_clmulepi64_epi128 (__A=..., __B=..., __C=0)
at 
/home/andres/build/gcc/master/install/lib/gcc/x86_64-pc-linux-gnu/15/include/vpclmulqdqintrin.h:42
42return (__m512i) __builtin_ia32_vpclmulqdq_v8di ((__v8di)__A,
(gdb) bt
#0  0x00d5946c in _mm512_clmulepi64_epi128 (__A=..., __B=..., __C=0)
at 
/home/andres/build/gcc/master/install/lib/gcc/x86_64-pc-linux-gnu/15/include/vpclmulqdqintrin.h:42
#1  pg_comp_crc32c_avx512 (crc=, data=, 
length=)
at ../../../../../home/andres/src/postgresql/src/port/pg_crc32c_avx512.c:163
#2  0x00819343 in ReadControlFile () at 
../../../../../home/andres/src/postgresql/src/backend/access/transam/xlog.c:4375
#3  0x0081c4ac in LocalProcessControlFile (reset=) at 
../../../../../home/andres/src/postgresql/src/backend/access/transam/xlog.c:4817
#4  0x00a8131d in PostmasterMain (argc=argc@entry=85, 
argv=argv@entry=0x341b08f0)
at 
../../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:902
#5  0x009b53fe in main (argc=85, argv=0x341b08f0) at 
../../../../../home/andres/src/postgresql/src/backend/main/main.c:197


Cascade lake doesn't have vpclmulqdq, so we shouldn't be getting here...

This is on an optimied build with meson, with -march=native included in
c_flags.

Relevant configure output:

Checking if "XSAVE intrinsics without -mxsave" : links: NO (cached)
Checking if "XSAVE intrinsics with -mxsave" : links: YES (cached)
Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO 
(cached)
Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES 
(cached)
Checking if "_mm512_clmulepi64_epi128 ... with -msse4.2 -mavx512vl 
-mvpclmulqdq" : links: YES
Checking if "x86_64: popcntq instruction" compiles: YES (cached)


Greetings,

Andres Freund




Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-06-12 Thread Andres Freund
Hi,

I'm wonder if this isn't going in the wrong direction. We're using CRCs for
something they're not well suited for in my understanding - and are paying a
reasonably high price for it, given that even hardware accelerated CRCs aren't
blazingly fast.

CRCs are used for things like ethernet, iSCSI because they are good at
detecting the kinds of errors encountered, namely short bursts of
bitflips. And the covered data is limited to a fairly small limit.

Which imo makes CRCs a bad choice for WAL. For one, we don't actually expect a
short burst of bitflips, the most likely case is all bits after some point
changing (because only one part of the record made it to disk). For another,
WAL records are *not* limited to a small size, and if anything error detection
becomes more important with longer records (they're likely to be span more
pages / segments).


It's hard to understand, but a nonetheless helpful page is
https://users.ece.cmu.edu/~koopman/crc/crc32.html which lists properties for
crc32c:
https://users.ece.cmu.edu/~koopman/crc/c32/0x8f6e37a0_len.txt
which lists
(0x8f6e37a0; 0x11edc6f41) <=> (0x82f63b78; 0x105ec76f1) 
{2147483615,2147483615,5243,5243,177,177,47,47,20,20,8,8,6,6,1,1} | gold | 
(*op) iSCSI; CRC-32C; CRC-32/4

This cryptic notion AFAIU indicates that for our polynomial we can detect 2bit
errors up to a length of 2147483615 bytes, 3 bit errors up to 2147483615, 3
and 4 bit errors up to 5243, 5 and 6 bit errors up to 177, 7/8 bit errors up
to 47.

IMO for our purposes just about all errors are going to be at least at sector
boundaries, i.e. 512 bytes and thus are at least 8 bit large. At that point we
are only guaranteed to find a single-byte error (it'll be common to have
much more) up to a lenght of 47bits. Which isn't a useful guarantee.


With that I perhaps have established that CRC guarantees aren't useful for us.
But not yet why we should use something else: Given that we already aren't
relying on hard guarantees, we could instead just use a fast hash like xxh3.
https://github.com/Cyan4973/xxHash which is fast both for large and small
amounts of data.


Greetings,

Andres Freund




Re: On disable_cost

2024-06-12 Thread Andres Freund
Hi,

On 2024-06-12 14:33:31 -0400, Robert Haas wrote:
> On Wed, Jun 12, 2024 at 2:11 PM Andres Freund  wrote:
> > 
> >
> > In an extreme case i can see a tiny bit of overhead, but not enough to be
> > worth worrying about. Mostly because we're so profligate in doing
> > bms_overlap() that cost comparisons don't end up mattering as much - I seem 
> > to
> > recall that being different in the not distant past though.
> 
> There are very few things I love more than when you can't resist
> trying to break my patches and yet fail to find a problem. Granted the
> latter part only happens once a century or so, but I'll take it.

:)


Too high cost in path cost comparison is what made me look at the PG code for
the first time, IIRC :)



> > Aside: I'm somewhat confused by add_paths_to_joinrel()'s handling of
> > mergejoins_allowed. If mergejoins are disabled we end up reaching
> > match_unsorted_outer() in more cases than with mergejoins enabled. E.g. we
> > only set mergejoin_enabled for right joins inside 
> > select_mergejoin_clauses(),
> > but we don't call select_mergejoin_clauses() if !enable_mergejoin and 
> > jointype
> > != FULL.  I, what?
> 
> I agree this logic is extremely confusing, but "we only set
> mergejoin_enabled for right joins inside select_mergejoin_clauses()"
> doesn't seem to be true.

Sorry, should have been more precise. With "set" I didn't mean set to true,
but that that it's only modified within select_mergejoin_clauses().


> It starts out true, and always stays true except for right, right-anti, and
> full joins, where select_mergejoin_clauses() can set it to false. Since the
> call to match_unsorted_outer() is gated by mergejoin_enabled, you might
> think that we'd skip considering nested loops on the strength of not being
> able to do a merge join, but comment "2." in add_paths_to_joinrel explains
> that the join types for which mergejoin_enabled can end up false aren't
> supported by nested loops anyway. Still, this logic is really tortured.

Agree that that's the logic - but doesn't that mean we'll consider nestloops
for e.g. right joins iff enable_mergejoin=false?


Greetings,

Andres Freund




Re: On disable_cost

2024-06-12 Thread Andres Freund
Hi,

On 2024-06-12 11:35:48 -0400, Robert Haas wrote:
> Subject: [PATCH v2 3/4] Treat the # of disabled nodes in a path as a separate
>  cost metric.
> 
> Previously, when a path type was disabled by e.g. enable_seqscan=false,
> we either avoided generating that path type in the first place, or
> more commonly, we added a large constant, called disable_cost, to the
> estimated startup cost of that path. This latter approach can distort
> planning. For instance, an extremely expensive non-disabled path
> could seem to be worse than a disabled path, especially if the full
> cost of that path node need not be paid (e.g. due to a Limit).
> Or, as in the regression test whose expected output changes with this
> commit, the addition of disable_cost can make two paths that would
> normally be distinguishible cost seem to have fuzzily the same cost.
> 
> To fix that, we now count the number of disabled path nodes and
> consider that a high-order component of both the cost. Hence, the
> path list is now sorted by disabled_nodes and then by total_cost,
> instead of just by the latter, and likewise for the partial path list.
> It is important that this number is a count and not simply a Boolean;
> else, as soon as we're unable to respect disabled path types in all
> portions of the path, we stop trying to avoid them where we can.


>   if (criterion == STARTUP_COST)
>   {
>   if (path1->startup_cost < path2->startup_cost)
> @@ -118,6 +127,15 @@ compare_fractional_path_costs(Path *path1, Path *path2,
>   Costcost1,
>   cost2;
>  
> + /* Number of disabled nodes, if different, trumps all else. */
> + if (unlikely(path1->disabled_nodes != path2->disabled_nodes))
> + {
> + if (path1->disabled_nodes < path2->disabled_nodes)
> + return -1;
> + else
> + return +1;
> + }

I suspect it's going to be ok, because the branch is going to be very
predictable in normal workloads, but I still worry a bit about making
compare_path_costs_fuzzily() more expensive. For more join-heavy queries it
can really show up and there's plenty ORM generated join-heavy query
workloads.

If costs were 32 bit integers, I'd have suggested just stashing the disabled
counts in the upper 32 bits of a 64bit integer. But ...




In an extreme case i can see a tiny bit of overhead, but not enough to be
worth worrying about. Mostly because we're so profligate in doing
bms_overlap() that cost comparisons don't end up mattering as much - I seem to
recall that being different in the not distant past though.


Aside: I'm somewhat confused by add_paths_to_joinrel()'s handling of
mergejoins_allowed. If mergejoins are disabled we end up reaching
match_unsorted_outer() in more cases than with mergejoins enabled. E.g. we
only set mergejoin_enabled for right joins inside select_mergejoin_clauses(),
but we don't call select_mergejoin_clauses() if !enable_mergejoin and jointype
!= FULL.  I, what?


Greetings,

Andres Freund




Re: RFC: adding pytest as a supported test framework

2024-06-12 Thread Andres Freund
Hi,

On 2024-06-11 07:28:23 -0700, Jacob Champion wrote:
> On Mon, Jun 10, 2024 at 1:04 PM Andres Freund  wrote:
> > Just for context for the rest the email: I think we desperately need to move
> > off perl for tests. The infrastructure around our testing is basically
> > unmaintained and just about nobody that started doing dev stuff in the last 
> > 10
> > years learned perl.
> 
> Okay. Personally, I'm going to try to stay out of discussions around
> subtracting Perl and focus on adding Python, for a bunch of different
> reasons:

I think I might have formulated my paragraph above badly - I didn't mean that
we should move away from perl tests tomorrow, but that we need a path forward
that allows folks to write tests without perl.


> - Tests aren't cheap, but in my experience, the maintenance-cost math
> for tests is a lot different than the math for implementations.

At the moment they tend to be *more* expensive often, due to spurious
failures. That's mostly not perl's fault, don't get me wrong, but us not
having better infrastructure for testing complicated behaviour and/or testing
things on a more narrow basis.


> > > Problem 1 (rerun failing tests): One architectural roadblock to this
> > > in our Test::More suite is that tests depend on setup that's done by
> > > previous tests. pytest allows you to declare each test's setup
> > > requirements via pytest fixtures, letting the test runner build up the
> > > world exactly as it needs to be for a single isolated test. These
> > > fixtures may be given a "scope" so that multiple tests may share the
> > > same setup for performance or other reasons.
> >
> > OTOH, that's quite likely to increase overall test times very
> > significantly. Yes, sometimes that can be avoided with careful use of 
> > various
> > features, but often that's hard, and IME is rarely done rigiorously.
> 
> Well, scopes are pretty front and center when you start building
> pytest fixtures, and the complicated longer setups will hopefully
> converge correctly early on and be reused everywhere else. I imagine
> no one wants to build cluster setup from scratch.

One (the?) prime source of state in our tap tests is the
database. Realistically we can't just tear that one down and reset it between
tests without causing the test times to explode. So we'll have to live with
some persistent state.


> On a slight tangent, is this not a problem today?

It is, but that doesn't mean making it even bigger is unproblematic :)




> > I think part of the problem is that the information about what precisely
> > failed is often much harder to collect when testing multiple servers
> > interacting than when doing localized unit tests.
> >
> > I think we ought to invest a bunch in improving that, I'd hope that a lot of
> > that work would be largely independent of the language the tests are written
> > in.
> 
> We do a lot more acceptance testing than internal testing, which came
> up as a major complaint from me and others during the unconference.
> One of the reasons people avoid writing internal tests in Perl is
> because it's very painful to find a rhythm with Test::More.

What definition of internal tests are you using here?

I think a lot of our tests are complicated, fragile and slow because we almost
exclusively do end-to-end tests, because with a few exceptions we don't have a
way to exercise code in a more granular way.


> > > When it comes to third-party packages, which I think we're
> > > probably going to want in moderation, we would still need to discuss
> > > supply chain safety. Python is not as mature here as, say, Go.
> >
> > What external dependencies are you imagining?
> 
> The OAuth pytest suite makes extensive use of
> - psycopg, to easily drive libpq;

That's probably not going to fly. It introduces painful circular dependencies
between building postgres (for libpq), building psycopg (requiring libpq) and
testing postgres (requiring psycopg).

You *can* solve such issues, but we've debated that in the past, and I doubt
we'll find agreement on the awkwardness it introduces.


> - construct, for on-the-wire packet representations and manipulation; and

That seems fairly minimal.


> - pyca/cryptography, for easy generation of certificates and manual
> crypto testing.

That's a bit more painful, but I guess maybe not unrealistic?


> I'd imagine each would need considerable discussion, if there is
> interest in doing the same things that I do with them.

One thing worth thinking about is that such dependencies have to work on a
relatively large number of platforms / architectures. A lot of projects
don't...

Greetings,

Andres Freund




Re: RFC: adding pytest as a supported test framework

2024-06-12 Thread Andres Freund
Hi,

On 2024-06-11 08:04:57 -0400, Andrew Dunstan wrote:
> Some time ago I did some work on wrapping libpq using the perl FFI module.
> It worked pretty well, and would mean we could probably avoid many uses of
> IPC::Run, and would probably be substantially more efficient (no fork
> required). It wouldn't avoid all uses of IPC::Run, though.

FWIW, I'd *love* to see work on this continue. The reduction in test runtime
on windows is substantial and would shorten the hack->CI->fail->hack loop a
good bit shorter. And save money.


> But my point was mainly that while a new framework might have value, I don't
> think we need to run out and immediately rewrite several hundred TAP tests.

Oh, yea. That's not at all feasible to just do in one go.

Greetings,

Andres Freund




Re: Proposal: Document ABI Compatibility

2024-06-12 Thread Andres Freund
Hi,

On 2024-06-12 14:58:04 +0200, Jelte Fennema-Nio wrote:
> On Wed, 12 Jun 2024 at 14:44, Peter Eisentraut  wrote:
> > I think since around 6 years ago we have been much more vigilant about
> > avoiding ABI breaks.  So if there aren't any more recent examples of
> > breakage, then maybe that was ultimately successful, and the upshot is,
> > continue to be vigilant at about the same level?
> 
> While not strictly an ABI break I guess, the backport of 32d5a4974c81
> broke building Citus against 13.10 and 14.7[1].

I think that kind of thing is not something we (PG devs) really can do
anything about. It's also
a) fairly easy thing to fix
b) fails during compilation
c) doesn't break ABI afaict

Greetings,

Andres Freund




Re: Proposal: Document ABI Compatibility

2024-06-12 Thread Andres Freund
On 2024-06-11 10:55:38 -0400, David E. Wheeler wrote:
> ABI Policy
> ==
> 
> The PostgreSQL core team maintains two application binary interface (ABI) 
> guarantees: one for major releases and one for minor releases.

I.e. for major versions it's "there is none"?

> Major Releases
> --
> 
> Applications that use the PostgreSQL APIs must be compiled for each major 
> release supported by the application. The inclusion of `PG_MODULE_MAGIC` 
> ensures that code compiled for one major version will rejected by other major 
> versions.
> 
> Furthermore, new releases may make API changes that require code changes. Use 
> the `PG_VERSION_NUM` constant to adjust code in a backwards compatible way:
> 
> ``` c
> #if PG_VERSION_NUM >= 16
> #include "varatt.h"
> #endif
> ```
> 
> PostgreSQL avoids unnecessary API changes in major releases, but usually
> ships a few necessary API changes, including deprecation, renaming, and
> argument variation.


> In such cases the incompatible changes will be listed in the Release Notes.

I don't think we actually exhaustively list all of them.


> Minor Releases
> --
> 
> PostgreSQL makes every effort to avoid both API and ABI breaks in minor 
> releases. In general, an application compiled against any minor release will 
> work with any other minor release, past or future.

s/every/a reasonable/ or just s/every/an/


> When a change *is* required, PostgreSQL will choose the least invasive way
> possible, for example by squeezing a new field into padding space or
> appending it to the end of a struct. This sort of change should not impact
> dependent applications unless they use `sizeof(the struct)` or create their
> own instances of such structs --- patterns best avoided.

The padding case doesn't affect sizeof() fwiw.

I think there's too often not an alternative to using sizeof(), potentially
indirectly (via makeNode() or such. So this sounds a bit too general.

Greetings,

Andres Freund




Re: Postgresql OOM

2024-06-11 Thread Andres Freund
Hi,

On 2024-06-07 21:42:58 +0300, Radu Radutiu wrote:
> On Fri, Jun 7, 2024 at 7:59 PM Andres Freund  wrote:
> > On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote:
> > > I have a query that forces an out of memory error, where the OS will kill
> > > the postgresql process.
> >
> > FWIW, it can be useful to configure the OS with strict memory overcommit.
> > That
> > causes postgres to fail more gracefully, because the OOM killer won't be
> > invoked.
> >
> >
> > > The query plan (run immediately after a vacuum analyze) is at
> > > https://explain.depesz.com/s/ITQI#html .
> >
> > Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the
> > number of
> > workers? It'd be useful to get some of the information about the actual
> > numbers of tuples etc.

> I've tried first giving more memory to the OS and mounting a tmpfs
> in  pgsql_tmp. It didn't  work, I got
> ERROR:  invalid DSA memory alloc request size 1_140_850_688
> CONTEXT:  parallel worker
> I've seen around 2 million temporary files created before the crash.
> With work_mem 100MB I was not able to get it to work with 2 parallel
> workers.
> Next, I've increased work_mem to 200MB and now (with extra memory and
> tmpfs) it finished: https://explain.depesz.com/s/NnRC

That's helpful, thanks!

One thing to note is that postgres' work_mem is confusing - it applies to
individual query execution nodes, not the whole query. Additionally, when you
use parallel workers, each of the parallel workers can use the "full"
work_mem, rather than work_mem being evenly distributed across workers.

Within that, the memory usage in the EXPLAIN ANALYZE isn't entirely unexpected
(I'd say it's unreasonable if you're not a postgres dev, but that's a
different issue).

We can see each of the Hash nodes use ~1GB, which is due to
(1 leader + 4 workers) * work_mem = 5 * 200MB = 1GB.

In this specific query we probably could free the memory in the "lower" hash
join nodes once the node directly above has finished building, but we don't
have the logic for that today.


Of course that doesn't explain why the memory usage / temp file creation is so
extreme with a lower limit / fewer workers.  There aren't any bad statistics
afaics, nor can I really see a potential for a significant skew, it looks to
me that the hashtables are all built on a single text field and that nearly
all the input rows are distinct.


Could you post the table definition (\d+) and the database definition
(\l). One random guess I have is that you ended up with a "non-deterministic"
collation for that column and that we end up with a bad hash due to that.



Greetings,

Andres Freund




Re: RFC: adding pytest as a supported test framework

2024-06-10 Thread Andres Freund
Hi,

On 2024-06-10 16:46:56 -0400, Andrew Dunstan wrote:
> 
> On 2024-06-10 Mo 16:04, Andres Freund wrote:
> > Hi,
> > 
> > 
> > Just for context for the rest the email: I think we desperately need to move
> > off perl for tests. The infrastructure around our testing is basically
> > unmaintained and just about nobody that started doing dev stuff in the last 
> > 10
> > years learned perl.

> Andres,
> 
> I get that you don't like perl.

I indeed don't particularly like perl - but that's really not the main
issue. I've already learned [some of] it. What is the main issue is that I've
also watched several newer folks try to write tests in it, and it was not
pretty.


> But it's hard for me to take this terribly seriously. "desperately" seems
> like massive overstatement at best.

Shrug.


> As for what up and coming developers learn, they mostly don't learn C
> either, and that's far more critical to what we do.

C is a a lot more useful to to them than perl. And it's actually far more
widely known these days than perl. C does teach you some reasonably
low-level-ish understanding of hardware. There are gazillions of programs
written in C that we'll have to maintain for decades. I don't think that's
comparably true for perl.


> I'm not sure what part of the testing infrastructure you think is
> unmaintained. For example, the last release of Test::Simple was all the way
> back on April 25.

IPC::Run is quite buggy and basically just maintained by Noah these days.

Greetings,

Andres Freund




Re: RFC: adding pytest as a supported test framework

2024-06-10 Thread Andres Freund
Hi,


Just for context for the rest the email: I think we desperately need to move
off perl for tests. The infrastructure around our testing is basically
unmaintained and just about nobody that started doing dev stuff in the last 10
years learned perl.


On 2024-06-10 11:46:00 -0700, Jacob Champion wrote:
> 4. It'd be great to split apart client-side tests from server-side
> tests. Driving Postgres via psql all the time is fine for acceptance
> testing, but it becomes a big problem when you need to test how
> clients talk to servers with incompatible feature sets, or how a peer
> behaves when talking to something buggy.

That seems orthogonal to using pytest vs something else?


> == Why pytest? ==
> 
> From the small and biased sample at the unconference session, it looks
> like a number of people have independently settled on pytest in their
> own projects. In my opinion, pytest occupies a nice space where it
> solves some of the above problems for us, and it gives us plenty of
> tools to solve the other problems without too much pain.

We might be able to alleviate that by simply abstracting it away, but I found
pytest's testrunner pretty painful. Oodles of options that are not very well
documented and that often don't work because they are very specific to some
situations, without that being explained.


> Problem 1 (rerun failing tests): One architectural roadblock to this
> in our Test::More suite is that tests depend on setup that's done by
> previous tests. pytest allows you to declare each test's setup
> requirements via pytest fixtures, letting the test runner build up the
> world exactly as it needs to be for a single isolated test. These
> fixtures may be given a "scope" so that multiple tests may share the
> same setup for performance or other reasons.

OTOH, that's quite likely to increase overall test times very
significantly. Yes, sometimes that can be avoided with careful use of various
features, but often that's hard, and IME is rarely done rigiorously.


> Problem 2 (seeing what failed): pytest does this via assertion
> introspection and very detailed failure reporting. If you haven't seen
> this before, take a look at the pytest homepage [1]; there's an
> example of a full log.

That's not really different than what the perl tap test stuff allows. We
indeed are bad at utilizing it, but I'm not sure that switching languages will
change that.

I think part of the problem is that the information about what precisely
failed is often much harder to collect when testing multiple servers
interacting than when doing localized unit tests.

I think we ought to invest a bunch in improving that, I'd hope that a lot of
that work would be largely independent of the language the tests are written
in.


> Python's standard library has lots of power by itself, with very good
> documentation. And virtualenvs and better package tooling have made it
> much easier, IMO, to avoid the XKCD dependency tangle [4] of the
> 2010s.

Ugh, I think this is actually python's weakest area. There's about a dozen
package managers and "python distributions", that are at best half compatible,
and the documentation situation around this is *awful*.


> When it comes to third-party packages, which I think we're
> probably going to want in moderation, we would still need to discuss
> supply chain safety. Python is not as mature here as, say, Go.

What external dependencies are you imagining?



> == A Plan ==
> 
> Even if everyone were on board immediately, there's a lot of work to
> do. I'd like to add pytest in a more probationary status, so we can
> iron out the inevitable wrinkles. My proposal would be:
> 
> 1. Commit bare-bones support in our Meson setup for running pytest, so
> everyone can kick the tires independently.
> 2. Add a test for something that we can't currently exercise.
> 3. Port a test from a place where the maintenance is terrible, to see
> if we can improve it.
> 
> If we hate it by that point, no harm done; tear it back out. Otherwise
> we keep rolling forward.

I think somewhere between 1 and 4 a *substantial* amount of work would be
required to provide a bunch of the infrastructure that Cluster.pm etc
provide. Otherwise we'll end up with a lot of copy pasted code between tests.

Greetings,

Andres Freund




Re: Proposal: Document ABI Compatibility

2024-06-10 Thread Andres Freund
Hi,

On 2024-06-10 15:05:32 -0400, David E. Wheeler wrote:
> > An API break in PostgreSQL 10.4 and 9.6.9 makes it impossible
> > to use these versions: the "extract_actual_join_clauses" function
> > gained an additional parameter.
> 
> The 10.4 commit is 68fab04, and it does indeed add a new function:

That's 6 years ago, not sure we can really learn that much from that.

And it's not like it's actually impossible, #ifdefs aren't great, but they are
better than nothing.


> ``` patch
> --- a/src/include/optimizer/restrictinfo.h
> +++ b/src/include/optimizer/restrictinfo.h
> @@ -36,6 +36,7 @@ extern List *get_actual_clauses(List *restrictinfo_list);
>  extern List *extract_actual_clauses(List *restrictinfo_list,
>bool pseudoconstant);
>  extern void extract_actual_join_clauses(List *restrictinfo_list,
> +   Relids joinrelids,
> List **joinquals,
> List **otherquals);
>  extern bool join_clause_is_movable_to(RestrictInfo *rinfo, RelOptInfo 
> *baserel);
> ```
> 
> I wonder if that sort of change could be avoided in backpatches, maybe by 
> adding and using a `extract_actual_join_clauses_compat` function and using 
> that internally instead?
> 
> Or, to David C’s point, perhaps it would be better to say there are some 
> categories of APIs that are not subject to any guarantees in minor releases?

I'm honestly very dubious that this is a good point to introduce a bunch of
formalism. It's a already a lot of work to maintain them, if we make it even
harder we'll end up more fixes not being backported, because it's not worth
the pain.

To be blunt, the number of examples raised here doesn't seem to indicate that
this is an area where we need to invest additional resources. We are already
severely constrained as a project by committer bandwidth, there are plenty
other things that seem more important to focus on.

Greetings,

Andres Freund




Re: Remove dependence on integer wrapping

2024-06-10 Thread Andres Freund
Hi,

On 2024-06-09 21:57:54 -0400, Tom Lane wrote:
> BTW, while I approve of trying to get rid of our need for -fwrapv,
> I'm quite scared of actually doing it.

I think that's a quite fair concern. One potentially relevant datapoint is
that we actually don't have -fwrapv equivalent on all platforms, and I don't
recall a lot of complaints from windows users. Of course it's quite possible
that they'd never notice...

I think this is a good argument for enabling -ftrapv in development
builds. That gives us at least a *chance* of seeing these issues.


> Whatever cases you may have discovered by running the regression tests will
> be at best the tip of the iceberg.  Is there any chance of using static
> analysis to find all the places of concern?

The last time I tried removing -fwrapv both gcc and clang found quite a few
issues. I think I fixed most of those though, so presumably the issue that
remain are ones less easily found by simple static analysis.

I wonder if coverity would find more if we built without -fwrapv.


GCC has -Wstrict-overflow=n, which at least tells us where the compiler
optimizes based on the assumption that there cannot be overflow.  It does
generate a bunch of noise, but I think it's almost exclusively due to our
MemSet() macro.

Greetings,

Andres Freund




libpq contention due to gss even when not using gss

2024-06-10 Thread Andres Freund
Hi,

To investigate a report of both postgres and pgbouncer having issues when a
lot of new connections aree established, I used pgbench -C.  Oddly, on an
early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
only when using TCP, not with unix sockets.

c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 
host=127.0.0.1 user=test dbname=postgres password=fake'

host=127.0.0.1:   16465
host=127.0.0.1,gssencmode=disable 20860
host=/tmp:49286

Note that the server does *not* support gss, yet gss has a substantial
performance impact.

Obviously the connection rates here absurdly high and outside of badly written
applications likely never practically relevant. However, the number of cores
in systems are going up, and this quite possibly will become relevant in more
realistic scenarios (lock contention kicks in earlier the more cores you
have).

And it doesn't seem great that something as rarely used as gss introduces
overhead to very common paths.

Here's a bottom-up profile:

-   32.10%  pgbench  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
   - 32.09% queued_spin_lock_slowpath
  - 16.15% futex_wake
   do_futex
   __x64_sys_futex
   do_syscall_64
 - entry_SYSCALL_64_after_hwframe
- 16.15% __GI___lll_lock_wake
   - __GI___pthread_mutex_unlock_usercnt
  - 5.12% gssint_select_mech_type
 - 4.36% gss_inquire_attrs_for_mech
- 2.85% gss_indicate_mechs
   - gss_indicate_mechs_by_attrs
  - 1.58% gss_acquire_cred_from
   gss_acquire_cred
   pg_GSS_have_cred_cache
   select_next_encryption_method (inlined)
   init_allowed_encryption_methods (inlined)
   PQconnectPoll
   pqConnectDBStart (inlined)
   PQconnectStartParams
   PQconnectdbParams
   doConnect


And a bottom-up profile:

-   32.10%  pgbench  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
   - 32.09% queued_spin_lock_slowpath
  - 16.15% futex_wake
   do_futex
   __x64_sys_futex
   do_syscall_64
 - entry_SYSCALL_64_after_hwframe
- 16.15% __GI___lll_lock_wake
   - __GI___pthread_mutex_unlock_usercnt
  - 5.12% gssint_select_mech_type
 - 4.36% gss_inquire_attrs_for_mech
- 2.85% gss_indicate_mechs
   - gss_indicate_mechs_by_attrs
  - 1.58% gss_acquire_cred_from
   gss_acquire_cred
   pg_GSS_have_cred_cache
   select_next_encryption_method (inlined)
   init_allowed_encryption_methods (inlined)
   PQconnectPoll
   pqConnectDBStart (inlined)
   PQconnectStartParams
   PQconnectdbParams
   doConnect



Clearly the contention originates outside of our code, but is triggered by
doing pg_GSS_have_cred_cache() every time a connection is established.

Greetings,

Andres Freund




Re: Postgresql OOM

2024-06-07 Thread Andres Freund
Hi,

On 2024-06-06 13:58:24 +0100, Pantelis Theodosiou wrote:
> I am not qualified to answer on the OOM issue but why are you joining the
> same table (outputrequest) 4 times (using an identical join condition)?

The conditions aren't actually the same
rpl_rec_tro.  input_sequence = r.input_sequence
rpl_snd_tro.reply_input_sequence = r.input_sequence
snd_tro.reply_input_sequence = t.input_sequence

First two are r.input_sequence to different columns, the third one also uses
reply_input_sequence but joins to t, not r.

Greetings,

Andres Freund




Re: Postgresql OOM

2024-06-07 Thread Andres Freund
Hi,

On 2024-06-06 15:25:25 +0300, Radu Radutiu wrote:
> I have a query that forces an out of memory error, where the OS will kill
> the postgresql process.

FWIW, it can be useful to configure the OS with strict memory overcommit. That
causes postgres to fail more gracefully, because the OOM killer won't be
invoked.


> The query plan (run immediately after a vacuum analyze) is at
> https://explain.depesz.com/s/ITQI#html .

Can you get EXPLAIN (ANALYZE, BUFFERS) to complete if you reduce the number of
workers? It'd be useful to get some of the information about the actual
numbers of tuples etc.


Greetings,

Andres Freund




Re: XACT_EVENT for 'commit prepared'

2024-06-07 Thread Andres Freund
Hi,

On 2024-06-07 11:19:40 -0400, Tom Lane wrote:
> Xiaoran Wang  writes:
> > I found that in enum XactEvent, there is  'XACT_EVENT_PREPARE'  for
> > 'prepare transaction', but there is no event for 'commit prepared' or
> > 'rollback prepared'.
> 
> On the whole, it seems like a good idea to me that those commands
> don't invoke event triggers.  It is a core principle of 2PC that
> if 'prepare' succeeded, 'commit prepared' must not fail.  Invoking a
> trigger during the second step would add failure cases and I'm not
> sure what value it has.

Event triggers? Isn't this about RegisterXactCallback?

XACT_EVENT_COMMIT is called after the commit record has been flushed and the
procarray has been modified. Thus a failure in the hook has somewhat limited
consequences. I'd assume XACT_EVENT_COMMIT_PREPARED would do something
similar.

I suspect the reason we don't callback for 2pc commit/rollback prepared is
simpl: The code for doing a 2pc commit prepared lives in twophase.c, not
xact.c...

Greetings,

Andres Freund




Re: PgStat_KindInfo.named_on_disk not required in shared stats

2024-06-07 Thread Andres Freund
Hi,

On 2024-06-07 14:07:33 +0900, Michael Paquier wrote:
> While hacking on the area of pgstat_*.c, I have noticed the existence
> of named_on_disk in PgStat_KindInfo, that is here to track the fact
> that replication slots are a particular case in the PgStat_HashKey for
> the dshash table of the stats because this kind of stats requires a
> mapping between the replication slot name and the hash key.
> 
> As far as I can see, this field is not required and is used nowhere,
> because the code relies on the existence of the to_serialized_name and
> from_serialized_name callbacks to do the mapping.
> 
> Wouldn't it make sense to remove it?  This field is defined since
> 5891c7a8ed8f that introduced the shmem stats, and has never been used
> since.

Yes, makes sense. Looks we changed direction during development a bunch of 
times...q


> This frees an extra bit in PgStat_KindInfo, which is going to help me
> a bit with what I'm doing with this area of the code while keeping the
> structure size the same.

Note it's just a single bit, not a full byte. So unless you need precisely 30
bits, rather than 29, I don't really see why it'd help? And i don't see a
reason to strictly keep the structure size the same.

Greetings,

Andres Freund




Re: relfilenode statistics

2024-06-06 Thread Andres Freund
Hi,

On 2024-06-03 11:11:46 +, Bertrand Drouvot wrote:
> The main argument is that we currently don’t have writes counters for 
> relations.
> The reason is that we don’t have the relation OID when writing buffers out.
> Tracking writes per relfilenode would allow us to track/consolidate writes per
> relation (example in the v1 patch and in the message up-thread).
> 
> I think that adding instrumentation in this area (writes counters) could be
> beneficial (like it is for the ones we currently have for reads).
> 
> Second argument is that this is also beneficial for the "Split index and
> table statistics into different types of stats" thread (mentioned in the 
> previous
> message). It would allow us to avoid additional branches in some situations 
> (like
> the one mentioned by Andres in the link I provided up-thread).

I think there's another *very* significant benefit:

Right now physical replication doesn't populate statistics fields like
n_dead_tup, which can be a huge issue after failovers, because there's little
information about what autovacuum needs to do.

Auto-analyze *partially* can fix it at times, if it's lucky enough to see
enough dead tuples - but that's not a given and even if it works, is often
wildly inaccurate.


Once we put things like n_dead_tup into per-relfilenode stats, we can populate
them during WAL replay. Thus after a promotion autovacuum has much better
data.


This also is important when we crash: We've been talking about storing a
snapshot of the stats alongside each REDO pointer. Combined with updating
stats during crash recovery, we'll have accurate dead-tuple stats once recovey
has finished.


Greetings,

Andres Freund




Re: relfilenode statistics

2024-06-06 Thread Andres Freund
Hi,

On 2024-06-06 12:27:49 -0400, Robert Haas wrote:
> On Wed, Jun 5, 2024 at 1:52 AM Bertrand Drouvot
>  wrote:
> > I think we should keep the stats in the relation during relfilenode changes.
> > As a POC, v1 implemented a way to do so during TRUNCATE (see the changes in
> > table_relation_set_new_filelocator() and in pg_statio_all_tables): as you 
> > can
> > see in the example provided up-thread the new heap_blks_written statistic 
> > has
> > been preserved during the TRUNCATE.
>
> Yeah, I think there's something weird about this design. Somehow we're
> ending up with both per-relation and per-relfilenode counters:
>
> +   pg_stat_get_blocks_written(C.oid) +
> pg_stat_get_relfilenode_blocks_written(d.oid, CASE WHEN
> C.reltablespace <> 0 THEN C.reltablespace ELSE d.dattablespace END,
> C.relfilenode) AS heap_blks_written,
>
> I'll defer to Andres if he thinks that's awesome, but to me it does
> not seem right to track some blocks written in a per-relation counter
> and others in a per-relfilenode counter.

It doesn't immediately sound awesome. Nor really necessary?

If we just want to keep prior stats upon arelation rewrite, we can just copy
the stats from the old relfilenode.  Or we can decide that those stats don't
really make sense anymore, and start from scratch.


I *guess* I could see an occasional benefit in having both counter for "prior
relfilenodes" and "current relfilenode" - except that stats get reset manually
and upon crash anyway, making this less useful than if it were really
"lifetime" stats.

Greetings,

Andres Freund




Re: [multithreading] extension compatibility

2024-06-05 Thread Andres Freund
Hi,

On 2024-06-05 21:59:42 -0400, Robert Haas wrote:
> On Wed, Jun 5, 2024 at 9:50 PM Andres Freund  wrote:
> > Depending on the architecture / ABI / compiler options it's often not
> > meaningfully more expensive to access a thread local variable than a 
> > "normal"
> > variable.
> >
> > I think these days it's e.g. more expensive on x86-64 windows, but not
> > linux. On arm the overhead of TLS is more noticeable, across platforms,
> > afaict.
> 
> I mean, to me, this still sounds like we want multithreading to be a
> build-time option.

Maybe. I think shipping a mode where users can fairly simply toggle between
threaded and process mode will allow us to get this stable a *lot* quicker
than if we distribute two builds. Most users don't build from source, distros
will have to pick the mode. If they don't choose threaded mode, we'll not find
problems. If they do choose threaded mode, we can't ask users to switch to a
process based mode to check if the problem is related.

We have been talking in a bunch of threads about having a mode where the main
postgres binary chooses a build optimized for the current architecture, to be
able to use SIMD instructions without a runtime check/dispatch. I guess we
could add threadedness to such a matrix...

Greetings,

Andres Freund




Re: [multithreading] extension compatibility

2024-06-05 Thread Andres Freund
Hi,

On 2024-06-05 21:10:01 -0400, Robert Haas wrote:
> On Wed, Jun 5, 2024 at 8:01 PM Heikki Linnakangas  wrote:
> > I'm very much in favor of a runtime toggle. To be precise, a
> > PGC_POSTMASTER setting. We'll get a lot more testing if you can easily
> > turn it on/off, and so far I haven't seen anything that would require it
> > to be a compile time option.
> 
> I was thinking about global variable annotations. If someone wants to
> build without multithreading, I think that they won't want to still
> end up with a ton of variables being changed to thread-local.

Depending on the architecture / ABI / compiler options it's often not
meaningfully more expensive to access a thread local variable than a "normal"
variable.

I think these days it's e.g. more expensive on x86-64 windows, but not
linux. On arm the overhead of TLS is more noticeable, across platforms,
afaict.

Example compiler output for x86-64 and armv8:
https://godbolt.org/z/K369eG5MM

Cycle analysis or linux x86-64 output:
https://godbolt.org/z/KK57vM1of

This shows that for the linux x86-64 case there's no difference in efficiency
between the tls/non-tls case.

The reason it's so fast on x86-64 linux is that they reused one of the "old"
segment registers to serve as the index register differing between each
thread.  For x86-64 code, most code is compiled position independent, and
*also* uses an indexed mode (but relative to the instruction pointer).


I think we might be able to gain some small performance benefits via the
annotations, which actualy might make it viable to just apply the annotations
regardless of using threads or not.


Greetings,

Andres Freund




Re: Use the same Windows image on both VS and MinGW tasks

2024-06-03 Thread Andres Freund
Hi,

On 2023-08-29 15:18:29 +0300, Nazir Bilal Yavuz wrote:
> The VS and MinGW Windows images are merged on Andres' pg-vm-images
> repository now [1]. So, the old pg-ci-windows-ci-vs-2019 and
> pg-ci-windows-ci-mingw64 images will not be updated from now on. This
> new merged image (pg-ci-windows-ci) needs to be used on both VS and
> MinGW tasks.
> I attached a patch for using pg-ci-windows-ci Windows image on VS and
> MinGW tasks.

Thanks!  Pushed to 15, 16 and master.

Greetings,

Andres Freund




Re: allow changing autovacuum_max_workers without restarting

2024-06-03 Thread Andres Freund
Hi,

On 2024-06-03 14:28:13 -0500, Nathan Bossart wrote:
> On Mon, Jun 03, 2024 at 12:08:52PM -0700, Andres Freund wrote:
> > Why do we think that increasing the number of PGPROC slots, heavyweight 
> > locks
> > etc by 256 isn't going to cause issues?  That's not an insubstantial amount 
> > of
> > memory to dedicate to something that will practically never be used.
> 
> I personally have not observed problems with these kinds of bumps in
> resource usage, although I may be biased towards larger systems where it
> doesn't matter as much.

IME it matters *more* on larger systems. Or at least used to, I haven't
experimented with this in quite a while.

It's possible that we improved a bunch of things sufficiently for this to not
matter anymore.

Greetings,

Andres Freund




Re: Fix an incorrect assertion condition in mdwritev().

2024-06-03 Thread Andres Freund
Hi,

On 2024-06-04 07:17:51 +0900, Michael Paquier wrote:
> On Sun, May 26, 2024 at 12:08:46AM -0400, Tom Lane wrote:
> > After a few minutes' thought, how about:
> > 
> > Assert((uint64) blocknum + (uint64) nblocks <= (uint64) mdnblocks(reln, 
> > forknum));
> > 
> > This'd stop being helpful if we ever widen BlockNumber to 64 bits,
> > but I think that's unlikely.  (Partitioning seems like a better answer
> > for giant tables.)
> 
> No idea if this will happen or not, but that's not the only area where
> we are going to need a native uint128 implementation to control the
> overflows with uint64.

I'm confused - isn't using common/int.h entirely sufficient for that? Nearly
all architectures have more efficient ways to check for 64bit overflows than
doing actual 128 bit math.

Greetings,

Andres Freund




Re: Proposal: Document ABI Compatibility

2024-06-03 Thread Andres Freund
Hi,

On 2024-06-03 15:21:04 -0400, David E. Wheeler wrote:
> > Extensions in general can do lots of stuff, guaranteeing that bug fixes 
> > don't
> > cause any problems is just not feasible.
> >
> > It'd be interesting to see a few examples of actual minor-version-upgrade
> > extension breakages, so we can judge what caused them.
>
> In the community Slack[4], Matthias van de Meent writes[5]:
>
> > Citus’ pg_version_compat.h[7] where it re-implements a low-level function 
> > that was newly introduced in PG14.7. If you build against PG14.7+ headers, 
> > you may get random crashes when running on 14.6.

I don't see how this would trigger random crashes.

Unfortunately [4] doesn't seem to take me to a relevant message (pruned chat
history?), so I can't infer more from that context.


> I suppose it would work fine on 14.7 if compiled on 14.6 though. I suspect
> there aren’t many examples, though, mostly just a lot of anxiety, and some
> have decided that extensions must be recompiled for every minor release in
> order to avoid the issue. StackGres[7] is one example, but I suspect Omni
> (Yurii’s company) may follow.

Regardless of ABI issues, it's probably a good idea to continually run tests
against in-development minor versions, just to prevent something breaking from
creeping in. IIRC there were a handful of cases where we accidentally broke
some extension, because they relied on some implementation details.


> >> Unless, that is, we can provide a complete list of things not to do (like
> >> make use of padding) to avoid it. Is that feasible?
> >
> > You can't really rely on the contents of padding, in general. So I don't 
> > think
> > this is really something that needs to be called out.
>
> Sure, probably not a problem, but if that’s the sole qualifier for making
> binary changes, I think it’s worth saying, as opposed to “we don’t make
> any”. Something like “Only changes to padding, which you never used anyway,
> right?” :-)

IDK, to me something like this seems to promise more than we actually can.


Greetings,

Andres Freund




Re: allow changing autovacuum_max_workers without restarting

2024-06-03 Thread Andres Freund
Hi,

On 2024-06-03 13:52:29 -0500, Nathan Bossart wrote:
> Here is an updated patch that uses 256 as the upper limit.

I don't have time to read through the entire thread right now - it'd be good
for the commit message of a patch like this to include justification for why
it's ok to make such a change. Even before actually committing it, so
reviewers have an easier time catching up.

Why do we think that increasing the number of PGPROC slots, heavyweight locks
etc by 256 isn't going to cause issues?  That's not an insubstantial amount of
memory to dedicate to something that will practically never be used.

ISTM that at the very least we ought to exclude the reserved slots from the
computation of things like the number of locks resulting from
max_locks_per_transaction.  It's very common to increase
max_locks_per_transaction substantially, adding ~250 to the multiplier can be
a good amount of memory. And AV workers should never need a meaningful number.

Increasing e.g. the size of the heavyweight lock table has consequences
besides the increase in memory usage, the size increase can make it less
likely for the table to fit largely into L3, thus decreasing performance.

Greetings,

Andres Freund




Re: Proposal: Document ABI Compatibility

2024-06-03 Thread Andres Freund
Hi,

On 2024-06-03 14:43:17 -0400, David E. Wheeler wrote:
> At the PGConf Unconference session on improving extension support in core,
> we talked quite a bit about the recent anxiety among extension developers
> about a lack of an ABI compatibility guarantee in Postgres.

Are there notes for the session?


> Yurii Rashkovskii did a little header file spelunking and talked[1] about a
> few changes in minor version releases, including to apparent field order in
> structs.

It'd be nice if the slides for the talk could be uploaded...


> > You must be referring to my commit 714780dc. The new field is stored within 
> > alignment padding (though only on back branches). Has this been tied to a 
> > known problem?
> 
> At the Unconference, Tom Lane said that this approach is pretty well drilled
> into the heads of every committer, and new ones pick it up through
> experience. The goal, IIUC, is to never introduce binary incompatibilities
> into the C APIs in minor releases. This standard would be good to document,
> to let extension developers know exactly what the guarantees are.

I don't think we can really make this a hard guarantee. Yes, we try hard to
avoid ABI breaks, but there IIRC have been a few cases over the years where
that wasn't practical for some reason. If we have to decide between a bad bug
and causing an ABI issue that's unlikely to affect anybody, we'll probably
choose the ABI issue.


> * The ABI is guaranteed to change only in backward compatible ways in minor
> releases. If for some reason it doesn’t it’s a bug that will need to be
> fixed.

Thus I am not really on board with this statement as-is.

Extensions in general can do lots of stuff, guaranteeing that bug fixes don't
cause any problems is just not feasible.

It'd be interesting to see a few examples of actual minor-version-upgrade
extension breakages, so we can judge what caused them.


> But if I understand correctly, the use of techniques like adding a new field
> in padding does not mean that extensions compiled on later minor releases
> will work on earlier minor releases of the same major version.

I don't think it's common for such new-fields-in-padding to cause problems
when using an earlier minor PG version. For that the extension would need to
actually rely on the presence of the new field, but typically that'd not be
the case when we introduce a new field in a minor version.


> Unless, that is, we can provide a complete list of things not to do (like
> make use of padding) to avoid it. Is that feasible?

You can't really rely on the contents of padding, in general. So I don't think
this is really something that needs to be called out.

Greetings,

Andres Freund




Re: Build with LTO / -flto on macOS

2024-06-03 Thread Andres Freund
Hi,

On 2024-06-03 17:07:22 +0200, Wolfgang Walther wrote:
> Peter Eisentraut:
> > It's probably worth clarifying that this option is needed on macOS only
> > if LTO is also enabled.  For standard (non-LTO) builds, the
> > export-dynamic behavior is already the default on macOS (otherwise
> > nothing in PostgreSQL would work).
> 
> Right, man page say this:
> 
> > Preserves all global symbols in main executables during LTO.  Without this
> option, Link Time Optimization is allowed to inline and remove global
> functions. This option is used when a main executable may load a plug-in
> which requires certain symbols from the main executable.

Gah. Apples tendency to just break stuff that has worked across *nix-y
platforms for decades is pretty annoying. They could just have made
--export-dynamic an alias for --export_dynamic, but no, everyone needs a
special macos thingy in their build scripts.


> Peter:
> > I don't think we explicitly offer LTO builds as part of the make build
> > system, so anyone trying this would do it sort of self-service, by
> > passing additional options to configure or make.  In which case they
> > might as well pass the -export_dynamic option along in the same way?
> 
> The challenge is that it defeats the purpose of LTO to pass this along to
> everything, e.g. via CFLAGS. The Makefiles set this in LDFLAGS_EX_BE only,
> so it only affects the backend binary. This is not at all obvious and took
> me quite a while to figure out why LTO silently didn't strip symbols from
> other binaries. It does work to explicitly set LDFLAGS_EX_BE, though.
> 
> Also, passing the LTO flag on Linux "just works" (clang, not GCC
> necessarily).

It should just work on gcc, or at least has in the recent past.


ISTM if we want to test for -export_dynamic like what you proposed, we should
do so only if --export-dynamic wasn't found. No need to incur the overhead on
!macos.

Greetings,

Andres Freund




Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2024-06-02 Thread Andres Freund
Hi,

At some point this patch switched from rdtsc to rdtscp, which imo largely
negates the point of it. What lead to that?

Greetings,

Andres Freund




Re: meson "experimental"?

2024-05-31 Thread Andres Freund
Hi, 

On May 30, 2024 8:03:33 AM PDT, Andrew Dunstan  wrote:
>On Thu, May 30, 2024 at 6:32 AM Aleksander Alekseev <
>aleksan...@timescale.com> wrote:
>
>>
>>
>> By a quick look on the buildfarm we seem to use Ninja >= 1.11.1.
>> However since Meson can use both Ninja and VS as a backend I'm not
>> certain which section would be most appropriate for naming the minimal
>> required version of Ninja.
>>
>>
>When I tried building with the VS backend it blew up, I don't recall the
>details. I think we should just use ninja everywhere. That keeps things
>simple. 

VS should work, and if not, we should fix it. It's slow, so I'd not use it for 
scheduled builds, but for people developing using visual studio. 

Andres 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: meson vs windows perl

2024-05-28 Thread Andres Freund
Hi,

On 2024-04-05 16:12:12 -0400, Andrew Dunstan wrote:
> OK, this has been fixed and checked. The attached is what I propose.

The perl command is pretty hard to read. What about using python's shlex
module instead? Rough draft attached.  Still not very pretty, but seems easier
to read?

It'd be even better if we could just get perl to print out the flags in an
easier to parse way, but I couldn't immediately see a way.

Greetings,

Andres Freund
diff --git i/meson.build w/meson.build
index d6401fb8e30..191a051defb 100644
--- i/meson.build
+++ w/meson.build
@@ -997,13 +997,20 @@ if not perlopt.disabled()
 undesired = run_command(perl_conf_cmd, 'ccdlflags', check: true).stdout().split()
 undesired += run_command(perl_conf_cmd, 'ldflags', check: true).stdout().split()
 
+ldopts_split = run_command(python, '-c', '''
+import shlex
+import sys
+
+print('\n'.join(shlex.split(sys.argv[1])), end='')
+''',
+ ldopts, check: true).stdout().split('\n')
+
 perl_ldopts = []
-foreach ldopt : ldopts.split(' ')
-  if ldopt == '' or ldopt in undesired
+foreach ldopt : ldopts_split
+  if ldopt in undesired
 continue
   endif
-
-  perl_ldopts += ldopt.strip('"')
+  perl_ldopts += ldopt
 endforeach
 
 message('LDFLAGS recommended by perl: "@0@"'.format(ldopts))


Re: First draft of PG 17 release notes

2024-05-24 Thread Andres Freund
Hi,

On 2024-05-22 18:33:03 -0400, Bruce Momjian wrote:
> On Tue, May 21, 2024 at 09:40:28AM -0700, Andres Freund wrote:
> > On 2024-05-18 11:13:54 -0400, Bruce Momjian wrote:
> > I agree keeping things reasonably short is important. But I don't think 
> > you're
> > evenly applying it as a goal.
> >
> > Just skimming the notes from the end, I see
> > - an 8 entries long pg_stat_statements section
>
> What item did you want to remove?  Those are all user-visible changes.

My point here was not that we necessarily need to remove those, but that their
impact to users is smaller than many of the performance impacts you disregard.


> > - multiple entries about "Create custom wait events for ..."
>
> Well, those are all in different sections, so how can they be merged,
> unless I create a "wait event section", I guess.

They're not, all are in "Additional Modules". Instead of

- Create custom wait events for postgres_fdw (Masahiro Ikeda)
- Create custom wait events for dblink (Masahiro Ikeda)
- Allow extensions to define custom wait events (Masahiro Ikeda)

I'd make it:

- Allow extensions to define custom wait events and create custom wait events
  for postgres_fdw, dblink (Masahiro Ikeda)


> > - three entries about adding --all to {reindexdb,vacuumdb,clusterdb}.
>
> The problem with merging these is that the "Specifically, --all can now
> be used with" is different for all three of them.

You said you were worried about the length of the release notes, because it
discourages users from actually reading the release notes, due to getting
bored. Having three instance of almost the same entry, with just minor changes
between them, seems to precisely endanger boring readers.

I'd probably just go for

- Add --all option to clusterdb, reindexdb, vacuumdb to process objects in all
  databases matching a pattern (Nathan Bossart)

or such. The precise details of how the option works for the different
commands doesn't need to be stated in the release notes, that's more of a
reference documentation thing. But if you want to include it, we can do
something like

  Specifically, --all can now be used with --table (all commands), --schema
  (reindexdb, vacuumdb), and --exclude-schema (reindexdb, vacuumdb).


> > - an entry about adding long options to pg_archivecleanup
>
> Well, that is a user-visible change.  Should it not be listed?

If you are concerned about the length of the release notes and as a
consequence not including more impactful performance changes, then no, it
shouldn't. It doesn't break anyones current scripts, it doesn't enable
anything new.


> > - two entries about grantable maintenance rights, once via pg_maintain, once
> >   per-table
>
> Well, one is a GRANT and another is a role, so merging them seemed like
> it would be too confusing.

I don't think it has to be.

Maybe something roughly like

- Allow granting the right to perform maintenance operations (Nathan Bossart)

  The permission can be granted on a per-table basis using the MAINTAIN
  privilege and on a system wide basis via the pg_maintain role.

  Operations that can be controlled are VACUUM, ANALYZE, REINDEX, REFRESH
  MATERIALIZED VIEW, CLUSTER, and LOCK TABLE.


I'm again mostly reacting to your concern that the release notes are getting
too boring to read. Repeated content, like in the current formulation, imo
does endanger that. Current it is:

- Add per-table GRANT permission MAINTAIN to control maintenance operations 
(Nathan Bossart)

  The operations are VACUUM, ANALYZE, REINDEX, REFRESH MATERIALIZED VIEW, 
CLUSTER, and LOCK TABLE.

- Add user-grantable role pg_maintain to control maintenance operations (Nathan 
Bossart)

  The operations are VACUUM, ANALYZE, REINDEX, REFRESH MATERIALIZED VIEW, 
CLUSTER, and LOCK TABLE.



> > - separate entries about pg_stat_reset_slru(), pg_stat_reset_shared("slru"),
>
> They are different functions with different detail text.

So what? You can change their text. Making it three entries makes it harder
for a reader that doesn't care about resetting stats to skip over the details.

Make it something like

- Improve control over resetting statistics (Atsushi Torikoshi, Bharath
  Rupireddy)

  pg_stat_reset_shared() can now reset all shared statistics, by passing NULL;
  pg_stat_reset_shared(NULL) also resets SLRU statistics;
  pg_stat_reset_shared("slru") resets SLRU statistics, which was already
  possible using pg_stat_reset_slru(NULL).

Greetings,

Andres Freund




Re: First draft of PG 17 release notes

2024-05-24 Thread Andres Freund
Hi,

On 2024-05-23 23:27:04 -0400, Bruce Momjian wrote:
> On Thu, May 23, 2024 at 11:11:10PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > I am not sure Bruce that you realize that your disregard for
> > performance improvements is shared by nobody.  Arguably,
> > performance is 90% of what we do these days, and it's also
> > 90% of what users care about.
>
> Please stop saying I don't document performance.  I have already
> explained enough which performance items I choose.  Please address my
> criteria or suggest new criteria.

Bruce, just about everyone seems to disagree with your current approach. And
not just this year, this has been a discussion in most if not all release note
threads of the last few years.

People, including me, *have* addressed your criteria, but you just waved those
concerns away. It's hard to continue discussing criteria when it doesn't at
all feel like a conversation.

In the end, these are patches to the source code, I don't think you can just
wave away widespread disagreement with your changes. That's not how we do
postgres development.

Greetings,

Andres Freund




Re: Upgrade Debian CI images to Bookworm

2024-05-24 Thread Andres Freund
Hi,

On 2024-05-24 16:17:37 +0200, Peter Eisentraut wrote:
> I'm not sure what the backpatching expectations of this kind of thing is.
> The history of this CI setup is relatively short, so this hasn't been
> stressed too much.  I see that we once backpatched the macOS update, but
> that might have been all.

I've backpatched a few other changes too.


> If we start backpatching this kind of thing, then this will grow as a job
> over time.  We'll have 5 or 6 branches to keep up to date, with several
> operating systems.  And once in a while we'll have to make additional
> changes like this warning fix you mention here.  I'm not sure how much we
> want to take this on.  Is there ongoing value in the CI setup in
> backbranches?

I find it extremely useful to run CI on backbranches before
batckpatching. Enough so that I've thought about proposing backpatching CI all
the way.

I don't think it's that much work to fix this kind of thing in the
backbranches. We don't need to backpatch new tasks or such. Just enough stuff
to keep e.g. the base image the same - otherwise we end up running CI on
unsupported distros, which doesn't help anybody.


> With these patches, we could do either of the following:
> 5) We update master, PG16, and PG15, but we hold all of them until the
> warning in PG15 is fixed.

I think we should apply the fix in <= 15 - IMO it's a correct compiler
warning, what we do right now is wrong.

Greetings,

Andres Freund




Re: Convert node test compile-time settings into run-time parameters

2024-05-21 Thread Andres Freund
Hi,

On 2024-05-20 09:28:39 +0200, Peter Eisentraut wrote:
> - Performance?  Looking for example at pg_parse_query() and its siblings,
> they also check for other debugging settings like log_parser_stats in the
> main code path, so it doesn't seem to be a concern.

I don't think we can conclude that. Just because we've not been that careful
about performance in a few spots doesn't mean we shouldn't be careful in other
areas. And I think something like log_parser_stats is a lot more generally
useful than debug_copy_parse_plan_trees.

The branch itself isn't necessarily the issue, the branch predictor can handle
that to a good degree. The reduction in code density is a bigger concern - and
also very hard to measure, because the cost is very incremental and
distributed.

At the very least I'd add unlikely() to all of the branches, so the debug code
can be placed separately from the "normal" portions.


Where I'd be more concerned about peformance is the added branch in
READ_LOCATION_FIELD. There are a lot of calls to that, addding runtime
branches to each, with external function calls inside, is somewhat likely to
be measurable.


> - Access control?  I have these settings as PGC_USERSET for now. Maybe they
> should be PGC_SUSET?

That probably would be right.


> Another thought:  Do we really need three separate settings?

Maybe not three settings, but a single setting, with multiple values, like
debug_io_direct?


Greetings,

Andres Freund




Re: zlib detection in Meson on Windows broken?

2024-05-21 Thread Andres Freund
Hi,

On 2024-05-20 11:58:05 +0100, Dave Page wrote:
> I then attempt to build PostgreSQL:
> 
>  meson setup build
> -Dextra_include_dirs=C:/build64/openssl/include,C:/build64/zlib/include
> -Dextra_lib_dirs=C:/build64/openssl/lib,C:/build64/zlib/lib -Dssl=openssl
> -Dzlib=enabled --prefix=c:/build64/pgsql
> 
> Which results in the output in output.txt, indicating that OpenSSL was
> correctly found, but zlib was not. I've also attached the meson log.

I forgot to mention that earlier: Assuming you're building something to be
distributed, I'd recommend --auto-features=enabled/disabled and specifying
specifically which dependencies you want to be used.

Greetings,

Andres Freund




Re: First draft of PG 17 release notes

2024-05-21 Thread Andres Freund
Hi,

On 2024-05-21 09:27:20 -0700, Andres Freund wrote:
> Also, the release notes are also not just important to users. I often go back
> and look in the release notes to see when some some important change was made,
> because sometimes it's harder to find it in the git log, due to sheer
> volume. And even just keeping up with all the changes between two releases is
> hard, it's useful to be able to read the release notes and see what happened.
>
> [...]
>
> [1] I've wondered if we should have one more level of TOC on the release note
> page, so it's easier to jump to specific sections.

Which reminds me: Eventually I'd like to add links to the most important
commits related to release note entries. We already do much of the work of
building that list of commits for each entry. That'd allow a reader to find
more details if interested.

Right one either has to open the sgml file (which no user will know to do), or
find the git entries manually. The latter of which is often hard, because the
git commits often will use different wording etc.

Admittedly doing so within the constraints of docbook and not wanting to
overly decrease density (both in .sgml and the resulting html) isn't a trivial
task.


That's entirely independent of my concern around noting performance
improvements in the release notes, of course.

Greetings,

Andres Freund




Re: First draft of PG 17 release notes

2024-05-21 Thread Andres Freund
Hi,

On 2024-05-18 11:13:54 -0400, Bruce Momjian wrote:
> Please see the email I just posted.  There are three goals we have to
> adjust for:
> 
> 1.  short release notes so they are readable
> 2.  giving people credit for performance improvements
> 3.  showing people Postgres cares about performance
> 
> I would like to achieve 2 & 3 without harming #1.  My experience is if I
> am reading a long document, and I get to a section where I start to
> wonder, "Why should I care about this?", I start to skim the rest of
> the document.

I agree keeping things reasonably short is important. But I don't think you're
evenly applying it as a goal.

Just skimming the notes from the end, I see
- an 8 entries long pg_stat_statements section
- multiple entries about "Create custom wait events for ..."
- three entries about adding --all to {reindexdb,vacuumdb,clusterdb}.
- an entry about adding long options to pg_archivecleanup
- two entries about grantable maintenance rights, once via pg_maintain, once
  per-table
- separate entries about pg_stat_reset_slru(), pg_stat_reset_shared("slru"),

If you're concerned about brevity, we can make things shorter without skipping
over most performance imporvements.


> I am particularly critical if I start to wonder, "Why
> does the author _think_ I should care about this?" becasue it feels like
> the author is writing for him/herself and not the audience.

FWIW, I think it's a good thing for somebody other than the author to have a
hand in writing a release notes entry for a change. The primary author(s) are
often too deep into some issue to have a good view of the right level of
detail and understandability.

Greetings,

Andres Freund




Re: First draft of PG 17 release notes

2024-05-21 Thread Andres Freund
Hi,

On 2024-05-18 10:59:47 -0400, Bruce Momjian wrote:
> On Wed, May 15, 2024 at 08:48:02PM -0700, Andres Freund wrote:
> > +many.
> >
> > We're having this debate every release. I think the ongoing reticence to 
> > note
> > performance improvements in the release notes is hurting Postgres.
> >
> > For one, performance improvements are one of the prime reason users
> > upgrade. Without them being noted anywhere more dense than the commit log,
> > it's very hard to figure out what improved for users. A halfway widely
> > applicable performance improvement is far more impactful than many of the
> > feature changes we do list in the release notes.
>
> I agree the impact of performance improvements are often greater than
> the average release note item.  However, if people expect Postgres to be
> faster, is it important for them to know _why_ it is faster?

Yes, it very often is. Performance improvements typically aren't "make
everything 3% faster", they're more "make this special thing 20%
faster". Without know what got faster, users don't know if
a) the upgrade will improve their production situation
b) they need to change something to take advantage of the improvement


> On the flip side, a performance improvement that makes everything 10%
> faster has little behavioral change for users, and in fact I think we
> get ~6% faster in every major release.

I cannot recall many "make everything faster" improvements, if any.

And even if it's "make everything faster" - that's useful for users to know,
it might solve their production problem!  It's also good for PR.


Given how expensive postgres upgrades still are, we can't expect production
workloads to upgrade to every major version. The set of performance
improvements and feature additions between major versions can help users make
an informed decision.


Also, the release notes are also not just important to users. I often go back
and look in the release notes to see when some some important change was made,
because sometimes it's harder to find it in the git log, due to sheer
volume. And even just keeping up with all the changes between two releases is
hard, it's useful to be able to read the release notes and see what happened.


> > For another, it's also very frustrating for developers that focus on
> > performance. The reticence to note their work, while noting other, far
> > smaller, things in the release notes, pretty much tells us that our work 
> > isn't
> > valued.
>
> Yes, but are we willing to add text that every user will have to read
> just for this purpose?

Of course it's a tradeoff. We shouldn't verbosely note down every small
changes just because of the recognition, that'd make the release notes
unreadable. And it'd just duplicate the commit log. But that's not the same as
defaulting to not noting performance improvements, even if the performance
improvement is more impactful than many other features that are noted.


> One think we _could_ do is to create a generic performance release note
> item saying performance has been improved in the following areas, with
> no more details, but we can list the authors on the item.

To me that's the "General Performance" section. If somebody reading the
release notes doesn't care about performance, they can just skip that section
([1]).  I don't see why we wouldn't want to include the same level of detail
as for other changes.

Greetings,

Andres Freund

[1] I've wondered if we should have one more level of TOC on the release note
page, so it's easier to jump to specific sections.




Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)

2024-05-21 Thread Andres Freund
On 2024-05-17 16:03:09 -0400, Peter Geoghegan wrote:
> On Fri, May 17, 2024 at 3:50 PM Andres Freund  wrote:
> > You're saying that the data is correctly accessible on primaries, but broken
> > on standbys? Is there any difference in how the page looks like on the 
> > primary
> > vs standby?
> 
> There clearly is. The relevant infomask bits are different. I didn't
> examine it much closer than that, though.

That could also just be because of a different replay position, hence my
question about that somewhere else in the email...




Re: zlib detection in Meson on Windows broken?

2024-05-21 Thread Andres Freund
Hi,

On 2024-05-20 11:58:05 +0100, Dave Page wrote:
> I have very little experience with Meson, and even less interpreting it's
> logs, but it seems to me that it's not including the extra lib and include
> directories when it runs the test compile, given the command line it's
> reporting:
> 
> cl C:\Users\dpage\git\postgresql\build\meson-private\tmpg_h4xcue\testfile.c
> /nologo /showIncludes /utf-8 /EP /nologo /showIncludes /utf-8 /EP /Od /Oi-
> 
> Bug, or am I doing something silly?

It's a buglet. We rely on meson's internal fallback detection of zlib, if it's
not provided via pkg-config or cmake. But it doesn't know about our
extra_include_dirs parameter. We should probably fix that...

Greetings,

Andres Freund




Re: Speed up clean meson builds by ~25%

2024-05-20 Thread Andres Freund
On 2024-05-20 09:37:46 -0400, Robert Haas wrote:
> On Sat, May 18, 2024 at 6:09 PM Andres Freund  wrote:
> > A few tests with ccache disabled:
> 
> These tests seem to show no difference between the two releases, so I
> wonder what you're intending to demonstrate here.

They show a few seconds of win for 'real' time.
-O0: 0m21.577s->0m19.529s
-O3: 0m59.730s->0m54.853s




Re: broken tables on hot standby after migration on PostgreSQL 16 (3x times last month)

2024-05-17 Thread Andres Freund
Hi,

On 2024-05-17 15:12:31 +0200, Pavel Stehule wrote:
> after migration on PostgreSQL 16 I seen 3x times (about every week) broken
> tables on replica nodes. The query fails with error

Migrating from what version?


You're saying that the data is correctly accessible on primaries, but broken
on standbys? Is there any difference in how the page looks like on the primary
vs standby?


> ERROR:  could not access status of transaction 1442871302
> DETAIL:  Could not open file "pg_xact/0560": No such file or directory
>
> verify_heapam reports
>
> ^[[Aprd=# select * from verify_heapam('account_login_history') where blkno
> = 179036;
>  blkno  | offnum | attnum |msg
>
> +++---
>  179036 | 30 || xmin 1393743382 precedes oldest valid
> transaction ID 3:1687012112

So that's not just a narrow race...


> master
>
> (2024-05-17 14:36:57) prd=# SELECT * FROM
> page_header(get_raw_page('account_login_history', 179036));
>   lsn  │ checksum │ flags │ lower │ upper │ special │ pagesize │
> version │ prune_xid
> ───┼──┼───┼───┼───┼─┼──┼─┼───
>  A576/810F4CE0 │0 │ 4 │   296 │   296 │8192 │ 8192 │
> 4 │ 0
> (1 row)
>
>
> replica
> prd_aukro=# SELECT * FROM page_header(get_raw_page('account_login_history',
> 179036));
>   lsn  | checksum | flags | lower | upper | special | pagesize |
> version | prune_xid
> ---+--+---+---+---+-+--+-+---
>  A56C/63979DA0 |0 | 0 |   296 |   296 |8192 | 8192 |
> 4 | 0
> (1 row)

Is the replica behind the primary? Or did we somehow end up with diverging
data? The page LSNs differ by about 40GB...

Is there evidence of failed truncations of the relation in the log? From
autovacuum?

Does the data in the readable versions of the tuples on that page actually
look valid? Is it possibly duplicated data?


I'm basically wondering whether it's possible that we errored out during
truncation (e.g. due to a file permission issue or such). Due to some
brokenness in RelationTruncate() that can lead to data divergence between
primary and standby and to old tuples re-appearing on either.


Another question: Do you use pg_repack or such?

Greetings,

Andres Freund




Re: problems with "Shared Memory and Semaphores" section of docs

2024-05-17 Thread Andres Freund
Hi,

On 2024-05-17 18:30:08 +, Imseih (AWS), Sami wrote:
> > The advantage of the GUC is that its value could be seen before trying to
> > actually start the server.
>
> Only if they have a sample in postgresql.conf file, right?
> A GUC like shared_memory_size_in_huge_pages will not be.

You can query gucs with -C. E.g.

postgres -D pgdev-dev -c shared_buffers=16MB -C shared_memory_size_in_huge_pages
13
postgres -D pgdev-dev -c shared_buffers=16MB -c huge_page_size=1GB -C 
shared_memory_size_in_huge_pages
1

Which is very useful to be able to actually configure that number of huge
pages. I don't think a system view or such would not help here.

Greetings,

Andres Freund




Lowering the minimum value for maintenance_work_mem

2024-05-16 Thread Andres Freund
Hi,

In the subthread at [1] I needed to trigger multiple rounds of index vacuuming
within one vacuum.

It turns out that with the new dead tuple implementation, that got actually
somewhat expensive. Particularly if all tuples on all pages get deleted, the
representation is just "too dense". Normally that's obviously very good, but
for testing, not so much:

With the minimum setting of maintenance_work_mem=1024kB, a simple table with
narrow rows, where all rows are deleted, the first cleanup happens after
3697812 dead tids. The table for that has to be > ~128MB.

Needing a ~128MB table to be able to test multiple cleanup passes makes it
much more expensive to test and consequently will lead to worse test coverage.

I think we should consider lowering the minimum setting of
maintenance_work_mem to the minimum of work_mem. For real-life workloads
maintenance_work_mem=1024kB is going to already be quite bad, so we don't
protect users much by forbidding a setting lower than 1MB.


Just for comparison, with a limit of 1MB, < 17 needed to do the first cleanup
pass after 174472 dead tuples. That's a 20x improvement. Really nice.

Greetings,

Andres Freund

[1\ https://postgr.es/m/20240516193953.zdj545efq6vabymd%40awork3.anarazel.de




gist index builds try to open FSM over and over

2024-05-16 Thread Andres Freund
f18) at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:1706
#18 0x00847996 in table_index_build_scan (table_rel=0x7f5b87979688, 
index_rel=0x7f5b87977f38, index_info=0x2fd9f50, allow_sync=true, progress=true,
callback=0x848b6b , callback_state=0x7ffd3ce87340, 
scan=0x0)
at 
../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1794
#19 0x00847da8 in gistbuild (heap=0x7f5b87979688, index=0x7f5b87977f38, 
indexInfo=0x2fd9f50)
at 
../../../../../home/andres/src/postgresql/src/backend/access/gist/gistbuild.c:313
#20 0x0094c945 in index_build (heapRelation=0x7f5b87979688, 
indexRelation=0x7f5b87977f38, indexInfo=0x2fd9f50, isreindex=false, 
parallel=true)
at 
../../../../../home/andres/src/postgresql/src/backend/catalog/index.c:3021
#21 0x0094970f in index_create (heapRelation=0x7f5b87979688, 
indexRelationName=0x2f2f798 "foo_i_idx1", indexRelationId=17747, 
parentIndexRelid=0,
parentConstraintId=0, relFileNumber=0, indexInfo=0x2fd9f50, 
indexColNames=0x2f2fc60, accessMethodId=783, tableSpaceId=0, 
collationIds=0x2f32340,
opclassIds=0x2f32358, opclassOptions=0x2f32370, coloptions=0x2f32388, 
stattargets=0x0, reloptions=0, flags=0, constr_flags=0,
allow_system_table_mods=false, is_internal=false, 
constraintId=0x7ffd3ce876f4)
at 
../../../../../home/andres/src/postgresql/src/backend/catalog/index.c:1275


The reason we reopen over and over is that we close the file in mdexist():

/*
 * Close it first, to ensure that we notice if the fork has been 
unlinked
 * since we opened it.  As an optimization, we can skip that in 
recovery,
 * which already closes relations when dropping them.
 */
if (!InRecovery)
mdclose(reln, forknum);

We call smgrexists as part of this code:

static Buffer
fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
...
/*
 * If we haven't cached the size of the FSM yet, check it first.  Also
 * recheck if the requested block seems to be past end, since our cached
 * value might be stale.  (We send smgr inval messages on truncation, 
but
 * not on extension.)
 */
if (reln->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber ||
blkno >= reln->smgr_cached_nblocks[FSM_FORKNUM])
{
/* Invalidate the cache so smgrnblocks asks the kernel. */
reln->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
if (smgrexists(reln, FSM_FORKNUM))
smgrnblocks(reln, FSM_FORKNUM);
else
reln->smgr_cached_nblocks[FSM_FORKNUM] = 0;
}

Because we set the size to 0 if the the fork doesn't exist, we'll reenter
during the next call, and then do the same thing again.


I don't think this is a huge performance issue or anything, but somehow it
seems indicative of something being "wrong".

It seems likely we encounter this issue not just with gist, but I haven't
checked yet.

Greetings,

Andres Freund




Re: race condition when writing pg_control

2024-05-16 Thread Andres Freund
Hi,

On 2024-05-16 15:01:31 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2024-05-16 14:50:50 -0400, Tom Lane wrote:
> >> The intention was certainly always that it be atomic.  If it isn't
> >> we have got *big* trouble.
> 
> > We unfortunately do *know* that on several systems e.g. basebackup can read 
> > a
> > partially written control file, while the control file is being
> > updated.
> 
> Yeah, but can't we just retry that if we get a bad checksum?

Retry what/where precisely? We can avoid the issue in basebackup.c by taking
ControlFileLock in the right moment - but that doesn't address
pg_start/stop_backup based backups. Hence the patch in the referenced thread
moving to replacing the control file by atomic-rename if there are base
backups ongoing.


> What had better be atomic is the write to disk.

That is still true to my knowledge.


> Systems that can't manage POSIX semantics for concurrent reads and writes
> are annoying, but not fatal ...

I think part of the issue that people don't agree what posix says about
a read that's concurrent to a write... See e.g.
https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic

Greetings,

Andres Freund




Re: race condition when writing pg_control

2024-05-16 Thread Andres Freund
Hi,

On 2024-05-16 14:50:50 -0400, Tom Lane wrote:
> Nathan Bossart  writes:
> > I suspect it will be difficult to investigate this one too much further
> > unless we can track down a copy of the control file with the bad checksum.
> > Other than searching for any new code that isn't doing the appropriate
> > locking, maybe we could search the buildfarm for any other occurrences.  I
> > also seem some threads concerning whether the way we are reading/writing
> > the control file is atomic.
> 
> The intention was certainly always that it be atomic.  If it isn't
> we have got *big* trouble.

We unfortunately do *know* that on several systems e.g. basebackup can read a
partially written control file, while the control file is being
updated. Thomas addressed this partially for frontend code, but not yet for
backend code. See
https://postgr.es/m/CA%2BhUKGLhLGCV67NuTiE%3Detdcw5ChMkYgpgFsa9PtrXm-984FYA%40mail.gmail.com

Greetings,

Andres Freund




Re: First draft of PG 17 release notes

2024-05-15 Thread Andres Freund
Hi,

On 2024-05-15 10:38:20 +0200, Alvaro Herrera wrote:
> I disagree with this.  IMO the impact of the Sawada/Naylor change is
> likely to be enormous for people with large tables and large numbers of
> tuples to clean up (I know we've had a number of customers in this
> situation, I can't imagine any Postgres service provider that doesn't).
> The fact that maintenance_work_mem is no longer capped at 1GB is very
> important and I think we should mention that explicitly in the release
> notes, as setting it higher could make a big difference in vacuum run
> times.

+many.

We're having this debate every release. I think the ongoing reticence to note
performance improvements in the release notes is hurting Postgres.

For one, performance improvements are one of the prime reason users
upgrade. Without them being noted anywhere more dense than the commit log,
it's very hard to figure out what improved for users. A halfway widely
applicable performance improvement is far more impactful than many of the
feature changes we do list in the release notes.

For another, it's also very frustrating for developers that focus on
performance. The reticence to note their work, while noting other, far
smaller, things in the release notes, pretty much tells us that our work isn't
valued.

Greetings,

Andres Freund




Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

2024-05-15 Thread Andres Freund
Hi,

On 2024-05-15 11:14:14 +0200, Alvaro Herrera wrote:
> On 2024-May-15, Bharath Rupireddy wrote:
> 
> > It looks like with the use of the new multi insert table access method
> > (TAM) for COPY (v20-0005), pgbench regressed about 35% [1].
> 
> Where does this acronym "TAM" comes from for "table access method"?  I
> find it thoroughly horrible and wish we didn't use it.  What's wrong
> with using "table AM"?  It's not that much longer, much clearer and
> reuses our well-established acronym AM.

Strongly agreed. I don't know why I dislike TAM so much though.

Greetings,

Andres Freund




Re: Adding the extension name to EData / log_line_prefix

2024-05-15 Thread Andres Freund
Hi,

On 2024-05-15 13:45:30 -0400, Tom Lane wrote:
> There is one advantage over my suggestion of changing PG_MODULE_MAGIC:
> if we tell people to write
> 
>PG_MODULE_MAGIC;
>#undef TEXTDOMAIN
>#define TEXTDOMAIN PG_TEXTDOMAIN("hstore")
> 
> then that's 100% backwards compatible and they don't need any
> version-testing ifdef's.
> 
> I still think that the kind of infrastructure Andres proposes
> is way overkill compared to the value, plus it's almost certainly
> going to have a bunch of platform-specific problems to solve.

Maybe I missing something here. Even adding those two lines to the extensions
in core and contrib is going to end up being more lines than what I proposed?

What portability issues do you forsee? We already look up the same symbol in
all the shared libraries ("Pg_magic_func"), so we know that we can deal with
duplicate function names. Are you thinking that somehow we'd end up with
symbol interposition or something?

Greetings,

Andres Freund




Re: Adding the extension name to EData / log_line_prefix

2024-05-15 Thread Andres Freund
Hi,

On 2024-05-15 12:54:45 -0400, Chapman Flack wrote:
> On 05/15/24 11:50, Tom Lane wrote:
> > Hmm, cute idea, but it'd only help for extensions that are
> > NLS-enabled.  Which I bet is a tiny fraction of the population.
> > So far as I can find, we don't even document how to set up
> > TEXTDOMAIN for an extension --- you have to cargo-cult the
> 
> But I'd bet, within the fraction of the population that does use it,
> it is already a short string that looks a whole lot like the name
> of the extension. So maybe enhancing the documentation and making it
> easy to set up would achieve much of the objective here.

The likely outcome would IMO be that some extensions will have the data,
others not. Whereas inferring the information from our side will give you
something reliable.

But I also just don't think it's something that architecturally fits together
that well. If we either had TEXTDOMAIN reliably set across extensions or it'd
architecturally be pretty, I'd go for it, but imo it's neither.

Greetings,

Andres Freund




Re: Adding the extension name to EData / log_line_prefix

2024-05-13 Thread Andres Freund
Hi,

On 2024-05-13 19:11:53 -0400, Tom Lane wrote:
> The mechanism that Andres describes for sourcing the name seems a bit
> overcomplex though.  Why not just allow/require each extension to
> specify its name as a constant string?  We could force the matter by
> redefining PG_MODULE_MAGIC as taking an argument:
>   PG_MODULE_MAGIC("hstore");

Mostly because it seemed somewhat sad to require every extension to have
version-specific ifdefs around that, particularly because it's not hard for us
to infer.

I think there might be other use cases for the backend to provide "extension
scoped" information, FWIW. Even just providing the full path to the extension
library could be useful.

Greetings,

Andres Freund




Re: Adding the extension name to EData / log_line_prefix

2024-05-13 Thread Andres Freund
Hi,

On 2024-05-13 19:25:11 -0300, Fabrízio de Royes Mello wrote:
> On Mon, May 13, 2024 at 5:51 PM Andres Freund  wrote:
> > It's not entirely trivial to provide errfinish() with a parameter
> indicating
> > the extension, but it's doable:
> >
> > 1) Have PG_MODULE_MAGIC also define a new variable for the extension name,
> >empty at that point
> >
> > 2) In internal_load_library(), look up that new variable, and fill it
> with a,
> >mangled, libname.
> >
> > 4) in elog.h, define a new macro depending on BUILDING_DLL (if it is set,
> >we're in the server, otherwise an extension). In the backend itself,
> define
> >it to NULL, otherwise to the variable created by PG_MODULE_MAGIC.
> >
> > 5) In elog/ereport/errsave/... pass this new variable to
> >errfinish/errsave_finish.
> >
> 
> Then every extension should define their own Pg_extension_filename, right?

It'd be automatically set by postgres when loading libraries.


> > I've attached a *very rough* prototype of this idea. My goal at this
> stage was
> > just to show that it's possible, not for the code to be in a reviewable
> state.
> >
> >
> > Here's e.g. what this produces with log_line_prefix='%m [%E] '
> >
> > 2024-05-13 13:50:17.518 PDT [postgres] LOG:  database system is ready to
> accept connections
> > 2024-05-13 13:50:19.138 PDT [cube] ERROR:  invalid input syntax for cube
> at character 13
> > 2024-05-13 13:50:19.138 PDT [cube] DETAIL:  syntax error at or near "f"
> > 2024-05-13 13:50:19.138 PDT [cube] STATEMENT:  SELECT cube('foo');
> >
> > 2024-05-13 13:43:07.484 PDT [postgres] LOG:  database system is ready to
> accept connections
> > 2024-05-13 13:43:11.699 PDT [hstore] ERROR:  syntax error in hstore:
> unexpected end of string at character 15
> > 2024-05-13 13:43:11.699 PDT [hstore] STATEMENT:  SELECT hstore('foo');
> >
> >
> 
> Was not able to build your patch by simply:

Oh, turns out it only builds with meson right now.  I forgot that, with
autoconf, for some unknown reason, we only set BUILDING_DLL on some OSs.

I attached a crude patch changing that.


> > It's worth pointing out that this, quite fundamentally, can only work
> when the
> > log message is triggered directly by the extension. If the extension code
> > calls some postgres function and that function then errors out, it'll be
> seens
> > as being part of postgres.
> >
> > But I think that's ok - they're going to be properly errcode-ified etc.
> >
> 
> Hmmm, depending on the extension it can extensively call/use postgres code
> so would be nice if we can differentiate if the code is called from
> Postgres itself or from an extension.

I think that's not realistically possible. It's also very fuzzy what that'd
mean. If there's a planner hook and then the query executes normally, what do
you report for an execution time error? And even the simpler case - should use
of pg_stat_statements cause everything within to be logged as a
pg_stat_statement message?

I think the best we can do is to actually say where the error is directly
triggered from.

Greetings,

Andres Freund
>From 1c59c465f2d359e8c47cf91d1ea458ea3b64ec84 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 13 May 2024 13:47:41 -0700
Subject: [PATCH v2 1/2] WIP: Track shared library in which elog/ereport is
 called

---
 src/include/fmgr.h |  1 +
 src/include/utils/elog.h   | 18 +-
 src/backend/utils/error/elog.c | 33 -
 src/backend/utils/fmgr/dfmgr.c | 30 ++
 4 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index ccb4070a251..3c7cfd7fee9 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -504,6 +504,7 @@ PG_MAGIC_FUNCTION_NAME(void) \
 	static const Pg_magic_struct Pg_magic_data = PG_MODULE_MAGIC_DATA; \
 	return &Pg_magic_data; \
 } \
+PGDLLEXPORT const char *Pg_extension_filename = NULL; \
 extern int no_such_variable
 
 
diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 054dd2bf62f..5e57f01823d 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -137,6 +137,13 @@ struct Node;
  * prevents gcc from making the unreachability deduction at optlevel -O0.
  *--
  */
+#ifdef BUILDING_DLL
+#define ELOG_EXTNAME NULL
+#else
+extern PGDLLEXPORT const char *Pg_extension_filename;
+#define ELOG_EXTNAME Pg_extension_filename
+#endif
+
 #ifdef HAVE__BUILTIN_CONSTANT_P
 #define ereport_domain(elevel, domain, ...)	\
 	do { \
@@ -144,7 +151,7 @@ struct Node;
 		if (__

<    1   2   3   4   5   6   7   8   9   10   >