Re: failures in t/031_recovery_conflict.pl on CI

2022-04-08 Thread Andres Freund
Hi,

On 2022-04-08 21:55:15 -0700, Andres Freund wrote:
> on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
> interesting, because I ran it there dozens if not hundreds of times before
> commit, with - I think - only cosmetic changes.

Scratch that part - I found an instance of the freebsd failure earlier, just
didn't notice because that run failed for other reasons as well. So this might
just have uncovered an older bug around recovery conflict handling,
potentially platform dependent.

I guess I'll try to reproduce it on freebsd...

Greetings,

Andres Freund




failures in t/031_recovery_conflict.pl on CI

2022-04-08 Thread Andres Freund
Hi,

on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
interesting, because I ran it there dozens if not hundreds of times before
commit, with - I think - only cosmetic changes.

I've reproduced it in a private branch, with more logging. And the results are
sure interesting.

https://cirrus-ci.com/task/6448492666159104
https://api.cirrus-ci.com/v1/artifact/task/6448492666159104/log/src/test/recovery/tmp_check/log/031_recovery_conflict_standby.log

The primary is waiting for 0/343A000 to be applied, which requires a recovery
conflict to be detected and resolved. On the standby there's the following
sequence (some omitted):

prerequisite for recovery conflict:
2022-04-09 04:05:31.292 UTC [35071][client backend] 
[031_recovery_conflict.pl][2/2:0] LOG:  statement: BEGIN;
2022-04-09 04:05:31.292 UTC [35071][client backend] 
[031_recovery_conflict.pl][2/2:0] LOG:  statement: DECLARE 
test_recovery_conflict_cursor CURSOR FOR SELECT b FROM 
test_recovery_conflict_table1;
2022-04-09 04:05:31.293 UTC [35071][client backend] 
[031_recovery_conflict.pl][2/2:0] LOG:  statement: FETCH FORWARD FROM 
test_recovery_conflict_cursor;

detecting the conflict:
2022-04-09 04:05:31.382 UTC [35038][startup] LOG:  recovery still waiting after 
28.821 ms: recovery conflict on buffer pin
2022-04-09 04:05:31.382 UTC [35038][startup] CONTEXT:  WAL redo at 0/3432800 
for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 
1663/16385/16386, blk 0

and then nothing until the timeout:
2022-04-09 04:09:19.317 UTC [35035][postmaster] LOG:  received immediate 
shutdown request
2022-04-09 04:09:19.317 UTC [35035][postmaster] DEBUG:  sending signal 3 to 
process 35071
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  reaping dead processes
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  reaping dead processes
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG:  server process (PID 
35071) exited with exit code 2

Afaics that has to mean something is broken around sending, receiving or
processing of recovery conflict interrupts.


All the failures so far were on freebsd, from what I can see. There were other
failures in other tests, but I think for reverted or fixed things.


Except for not previously triggering while the shmstats patch was in
development, it's hard to tell whether it's a regression or just a
longstanding bug - we never had tests for recovery conflicts...


I don't really see how recovery prefetching could play a role here, clearly
we've been trying to replay the record. So we're elsewhere...

Greetings,

Andres Freund

https://cirrus-ci.com/github/postgres/postgres/master




Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread Julien Rouhaud
Hi,

On Fri, Apr 08, 2022 at 09:39:18AM -0400, Stephen Frost wrote:
>
> * Magnus Hagander (mag...@hagander.net) wrote:
> > The addition to pg_stat_statements I pushed a short while ago would help
> > with that. But I think having a warning like this would also be useful. As
> > a stop-gap measure, yes, but we really don't know when we will have an
> > improved costing model for it. I hope you're right and that we can have it
> > by 16, and then I will definitely advocate for removing the warning again
> > if it works.
>
> Having this in pg_stat_statements is certainly helpful but having a
> warning also is.  I don't think we have to address this in only one way.
> A lot faster to flip this guc and then look in the logs on a busy system
> than to install pg_stat_statements, restart the cluster once you get
> permission to do so, and then query it.

+1, especially if you otherwise don't really need or want to have
pg_stat_statements enabled, as it's far from being free.  Sure you could enable
it by default with pg_stat_statements.track = none, but that seems a lot more
troublesome than just dynamically enabling a GUC, possibly for a short time
and/or for a specific database/role.




Re: Expose JIT counters/timing in pg_stat_statements

2022-04-08 Thread Julien Rouhaud
Hi,

On Sat, Apr 09, 2022 at 01:51:21AM +, Shinoda, Noriyoshi (PN Japan FSIP) 
wrote:
> Hi,
> thank you for the great features.
>
> The attached small patch changes the data type in the document.
> The following columns are actually double precision but bigint in the docs.
> jit_generation_time
> jit_inlining_time
> jit_optimization_time
> jit_emission_count

Indeed!  The patch looks good to me, I didn't find any other discrepancy.




Commitfest wrapup

2022-04-08 Thread Greg Stark
These remaining CF entries look like they're bugs that are maybe Open
Issues for release?

* fix spinlock contention in LogwrtResult

* Avoid erroring out when unable to remove or parse logical rewrite
files to save checkpoint work

* Add checkpoint and redo LSN to LogCheckpointEnd log message

* standby recovery fails when re-replaying due to missing directory
which was removed in previous replay.

* Logical replication failure "ERROR: could not map filenode
"base/13237/442428" to relation OID" with catalog modifying txns

* fix psql pattern handling

* Possible fails in pg_stat_statements test

* pg_receivewal fail to streams when the partial file to write is not
fully initialized present in the wal receiver directory

* Fix pg_rewind race condition just after promotion


Was the plan to commit this after feature freeze?

* pg_stat_statements: Track statement entry timestamp



A couple minor documentation, testing, and diagnostics patches that
may be committable even after feature freeze?

* Improve role attributes docs

* Reloption tests iprovement. Test resetting illegal option that was
actually set for some reason

* Make message at end-of-recovery less scary

* jit_warn_above_fraction parameter


And then there are the more devlish cases. I think some of these
patches are Rejected or Returned with Feedback but I'm not certain.
Some of them have split into multiple discussions or are partly
committed but still related work remains. Any opinions on what to do
with these?

* Simplify some RI checks to reduce SPI overhead

* Map WAL segment files on PMEM as WAL buffers

* Support custom authentication methods using hooks

* Implement INSERT SET syntax

* Logical insert/update/delete WAL records for custom table AMs


-- 
greg




cfbot requests

2022-04-08 Thread Justin Pryzby
I mentioned most/all of these ideas for cfbot at some point.  I'm writing them
now so other people know about them and they're in once place.

 - Keep the original patch series and commit messages, rather than squishing
them into a single commit with cfbot's own commit messages.  Maybe append an
empty commit with cfbot's message, and include a parsable "base-branch: NNN"
commit hash.  This supports some CI ideas like making HTML available for
patches touching doc/ (which needs to be relative to some other commit to show
only the changes rather than every *.html).  And code coverage report for
changed files, which has the same requirement.

That *also* allows directly reviewing the original patch series with a branch
maintained by cfbot, preserving the patch set, with commit messages.  See also:
https://www.postgresql.org/message-id/flat/CAKU4AWoU-P1zPS5hmiXpto6WGLOqk27VgCrxSKE2mgX%3DfypV6Q%40mail.gmail.com

Alternate idea: submit the patch to cirrus as a PR which makes the "base
branch" available to cirrus as an environment variable (note that PRs also
change a couple of other cirrus behaviors).

 - I think cfbot already keeps track of historic CI build results (pass/fail
and link to cirrus).  But I don't think cfbot exposes this yet.  I know cirrus
can show history for a branch, but I can never find it, and their history is
limited.  This would be like the buildfarm pages, ordered by time: one showing
all results, one showing all failures, and one showing all results for a given
patch.  You could also consider retrieving the cirrus logs themselves, to allow
setting our own retention interval (we could ask cirrus if they'd want to allow
setting more aggressive expiration for logs/artifacts).

 - HTML: sort by CF ID rather than alpha sort.  Right now, commitfest entries
beginning with a capital letter sort first, which at least one person seems to
have discovered.

 - HTML: add a "queued for CI" page showing the next patches to be submitted to
cirrus.  These pages might allow queueing a re-run, too.

 - HTML: show "target version" and "committer" (maybe committer name would be
shown in the existing list of names, but with another style applied).  This 
helps
to distinguish between patches which someone optimistically said was RFC and a
patch which a committer intends to commit, which ought to be pretty visible so
it's not lost in the mailing list and a surprise to anyone.

 - index on CF ID without CFNUM - reasons that you mentioned that I can't
remember (like maybe cirrus history and rebuilds at end of each CF).




Re: PostgreSQL commitfest: 2022-09-01

2022-04-08 Thread Justin Pryzby
On Fri, Apr 08, 2022 at 12:55:04PM +, webmas...@postgresql.org wrote:
> The following patches that you follow, directly or indirectly,
> have received updates in the PostgreSQL commitfest app:
> 
> 
> Function to log backtrace of postgres processes
> https://commitfest.postgresql.org/38/2863/
> 
> * New status: Needs review (stark)
> * Closed in commitfest 2022-03 with status: Moved to next CF (stark)
> 
> 
> 
> Add --schema and --exclude-schema options to vacuumdb
> https://commitfest.postgresql.org/39/3587/
> 
> * New status: Needs review (stark)
> * Closed in commitfest 2022-07 with status: Moved to next CF (stark)

The 2nd patch was already in July's CF, and it looks like you accidentally
moved it to July+1.  AFAIK this requires a DBA to fix.




RE: Expose JIT counters/timing in pg_stat_statements

2022-04-08 Thread Shinoda, Noriyoshi (PN Japan FSIP)
Hi,
thank you for the great features.

The attached small patch changes the data type in the document.
The following columns are actually double precision but bigint in the docs.
jit_generation_time
jit_inlining_time
jit_optimization_time
jit_emission_count

Regards,
Noriyoshi Shinoda

From: Magnus Hagander 
Sent: Friday, April 8, 2022 8:47 PM
To: Julien Rouhaud 
Cc: PostgreSQL Developers 
Subject: Re: Expose JIT counters/timing in pg_stat_statements



On Tue, Mar 8, 2022 at 4:08 AM Julien Rouhaud 
mailto:rjuju...@gmail.com>> wrote:
On Mon, Mar 07, 2022 at 01:40:34PM +0100, Magnus Hagander wrote:
>
> I wonder if there might be an interesting middle ground, or if that is
> making it too much. That is, we could have an
> Option 3:
> jit_count
> total_jit_time - for sum of functions+inlining+optimization+emission time
> min_jit_time - for sum of functions+inlining+optimization+emission time
> max_jit_time - for sum of functions+inlining+optimization+emission time
> mean_jit_time - for sum of functions+inlining+optimization+emission time
> stddev_jit_time - for sum of functions+inlining+optimization+emission time
> jit_functions
> jit_generation_time
> jit_inlining_count
> jit_inlining_time
> jit_optimization_count
> jit_optimization_time
> jit_emission_count
> jit_emission_time
>
> That is, we'd get the more detailed timings across the total time, but
> not on the details. But that might be overkill.

I also thought about it but it seems overkill.  pg_stat_statements view is
already very big, and I think that the JIT time should be somewhat stable, at
least compared to how much a query execution time can vary depending on the
parameters.  This approach would also be a bit useless if you change the
costing of underlying JIT operation.

> But -- here's an updated patched based on Option 2.

Thanks!

Code-wide, the patch looks good.  For the doc, it seems that you documented
jit_inlining_count three times rather than documenting jit_optimization_count
and jit_emission_count.

Oops, thanks and fixed.


I don't think we can add tests there, and having a test for every new counter
being >= 0 seems entirely useless, however there should be a new test added for
the "oldextversions" test to make sure that there's no issue with old SQL / new
shlib compatibility.  And looking at it I see that it was already missed for
version 1.9 :(

Indeed. Fixed here.

Michael had already applied a patch that took us to 1.10 and added that test, 
so I've just updated it here. I don't think we normally bump the version twice 
int he same day, so I just mergd the SQL script changes as well.

PFA a "final" version for the CI to run.

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/


pg_stat_statements_jit_doc_v1.diff
Description: pg_stat_statements_jit_doc_v1.diff


Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Justin Pryzby
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote:
> > I think there might be another problem.  The man page for rename() seems to
> > indicate that overwriting an existing file also introduces a window where
> > the old and new path are hard links to the same file.  This isn't a problem
> > for the WAL files because we should never be overwriting an existing one,
> > but I wonder if it's a problem for other code paths.  My guess is that many
> > code paths that overwrite an existing file are first writing changes to a
> > temporary file before atomically replacing the original.  Those paths are
> > likely okay, too, as you can usually just discard any existing temporary
> > files.
> 
> I wonder if this is really true. I thought rename() was supposed to be atomic.

Looks like it's atomic in that it's not like cp + rm, but not atomic the other
way you want.

|   If  newpath  already  exists, it will be atomically replaced, so that 
there is no point at which another process attempting to access newpath will 
find it missing.  However, there will probably be a window in which
|   both oldpath and newpath refer to the file being renamed.




Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart  wrote:
> On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> > I see that durable_rename_excl() has the following comment: "Similar
> > to durable_rename(), except that this routine tries (but does not
> > guarantee) not to overwrite the target file." If those are the desired
> > semantics, we could achieve them more simply and more safely by just
> > trying to stat() the target file and then, if it's not found, call
> > durable_rename(). I think that would be a heck of a lot safer than
> > what this function is doing right now.
>
> IIUC it actually does guarantee that you won't overwrite the target file
> when HAVE_WORKING_LINK is defined.  If not, it provides no guarantees at
> all.  Using stat() before rename() would therefore weaken this check for
> systems with working link(), but it'd probably strengthen it for systems
> without a working link().

Sure, but a guarantee that happens on only some systems isn't worth
much. And, if it comes at the price of potentially having multiple
hard links to the same file in obscure situations, that seems like it
could easily cause more problems than this whole scheme can ever hope
to solve.

> I think there might be another problem.  The man page for rename() seems to
> indicate that overwriting an existing file also introduces a window where
> the old and new path are hard links to the same file.  This isn't a problem
> for the WAL files because we should never be overwriting an existing one,
> but I wonder if it's a problem for other code paths.  My guess is that many
> code paths that overwrite an existing file are first writing changes to a
> temporary file before atomically replacing the original.  Those paths are
> likely okay, too, as you can usually just discard any existing temporary
> files.

I wonder if this is really true. I thought rename() was supposed to be atomic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-08 Thread Andres Freund
Hi,

On 2022-04-08 17:55:51 -0400, Tom Lane wrote:
> I tried adjusting the patch so it does guarantee that (as attached),
> and in two out of two tries it got past the archive_cleanup_command
> failure but then hung up waiting for standby2 to promote.

Adding

$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
just after
$node_standby2->start;

makes the tests pass here.


What is that second test really testing?

# Check the presence of temporary files specifically generated during
# archive recovery.  To ensure the presence of the temporary history
# file, switch to a timeline large enough to allow a standby to recover
# a history file from an archive.  As this requires at least two timeline
# switches, promote the existing standby first.  Then create a second
# standby based on the promoted one.  Finally, the second standby is
# promoted.

Note "Then create a second standby based on the promoted one." - but that's
not actually what's happening:

$node_standby2->init_from_backup($node_primary, $backup_name,
has_restoring => 1);

It's created from the original primary, not the first standby, as the
description says...

Both nodes get promoted independently, in a run without valgrind:

standby:
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG:  INSERT @ 
0/458:  - XLOG/END_OF_RECOVERY: tli 2; prev tli 1; time 2022-04-08 
17:23:42.96686-07
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG:  xlog flush 
request 0/458; write 0/400; flush 0/400

standby2:
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG:  INSERT @ 
0/458:  - XLOG/END_OF_RECOVERY: tli 3; prev tli 1; time 2022-04-08 
17:23:43.307443->
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG:  xlog flush 
request 0/458; write 0/400; flush 0/400

except that standby2 can't choose tli 2 because it finds it used.

Sure looks like something is funky with that test.


But I think there's also something funky in the prefetching logic. I think it
may attempt restoring during prefetching somehow, even though there's code
that appears to try to prevent that?

on standby2 I can see replay progress like the following:
2022-04-08 17:02:12.310 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/3024488; 
LSN 0/30244C8: prev 0/3024448; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 
4 - Heap/INSERT: off 60 flags 0x00
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] DEBUG:  record known 
xact 725 latestObservedXid 725
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 
0/3024488 for Heap/INSERT: off 60 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:12.312 PDT [2441453][startup][1/0:0][] DEBUG:  executing 
restore command "cp 
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/00010004"
 "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG:  could not 
restore file "00010004" from archive: child process exited with 
exit code 1
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG:  could not open 
file "pg_wal/00010004": No such file or directory
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/30244C8; 
LSN 0/3024508: prev 0/3024488; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 
4 - Heap/INSERT: off 61 flags 0x00
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] DEBUG:  record known 
xact 725 latestObservedXid 725
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 
0/30244C8 for Heap/INSERT: off 61 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:13.857 PDT [2441453][startup][1/0:0][] DEBUG:  executing 
restore command "cp 
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/00010004"
 "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG:  could not 
restore file "00010004" from archive: child process exited with 
exit code 1
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG:  could not open 
file "pg_wal/00010004": No such file or directory
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/3024508; 
LSN 0/3024548: prev 0/30244C8; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 
4 - Heap/INSERT: off 62 flags 0x00
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] DEBUG:  record known 
xact 725 latestObservedXid 725
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 
0/3024508 for Heap/INSERT: off 62 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:15.415 PDT [2441453][startup][1/0:0][] DEBUG:  executing 
restore command "cp 
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/00010004"
 "pg_wal/RECOVERYXLOG""

note 

Re: Mingw task for Cirrus CI

2022-04-08 Thread Justin Pryzby
On Thu, Apr 07, 2022 at 10:10:21AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-04-06 11:03:37 -0400, Andrew Dunstan wrote:
> > On 3/30/22 20:26, Andres Freund wrote:
> > > Could you try using dash to invoke configure here, and whether it makes 
> > > configure faster?
> > I got weird failures re libxml/parser.h when I tried with dash. See
> >  (It would be nice if we
> > could see config.log on failure.)
> 
> Since dash won't help us to get the build time down sufficiently, and the
> tests don't pass without a separate build tree, I looked at what makes
> config/prep_buildtree so slow.
> 
> It's largely just bad code. The slowest part are spawning one expr and mkdir
> -p for each directory. One 'cmp' for each makefile doesn't help either.
> 
> The expr can be replaced with
>   subdir=${item#$sourcetree}
> that's afaics posix syntax ([1]), not bash.
> 
> Spawning one mkdir for each directory can be replaced by a single mkdir
> invocation with all the directories. On my linux workstation that gets the
> time for the first loop down from 1005ms to 38ms, really.

Even better?

(cd "$sourcetree" && find . -print |grep -E '/Makefile$|/GNUmakefile$' |grep -v 
"$sourcetree/doc/src/sgml/images/" |xargs tar c) |
(cd "$buildtree" && tar x)




Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-08 Thread Andres Freund
Hi,

On 2022-04-08 17:55:51 -0400, Tom Lane wrote:
> After seeing skink's results, I tried running that test under valgrind
> here, and it fails just like that every time.  skink's history allows
> us to bound the failure introduction between 79b716cfb7 and
> d7ab2a9a3c, which I think makes it just about certain that it was
> 5dc0418fab (Prefetch data referenced by the WAL, take II), though I've
> not bisected to be 100% sure.

I've tested it, it's 5dc0418fab that makes the difference. I reduced the cycle
time by making initdb not go through valgrind, but have normal postgres
instances go through it.


> On the whole, I'm not sure that the WAL prefetch logic is noticeably
> more stable than when we booted it out last year :-(.

IDK. Last year's issues seems to have largely been caused by a flaky
machine. And a bug, if it's that, in some archiving corner case that's not
normally reached during tests...

Greetings,

Andres Freund




Re: Mingw task for Cirrus CI

2022-04-08 Thread Andres Freund
Hi,

On 2022-04-08 17:04:34 +0200, Alvaro Herrera wrote:
> > Since dash won't help us to get the build time down sufficiently, and the
> > tests don't pass without a separate build tree, I looked at what makes
> > config/prep_buildtree so slow.
>
> Maybe we can replace prep_buildtree with a Perl script.  Surely that
> should be faster.

Currently building doesn't depend on perl :(

I think the improvements that I suggested are big enough that they're worth
doing on their own, particularly for windows, but also other OSs.


I just realized that the second find is pretty expensive compared to the
first.

time find "$sourcetree" -type d \( \( -name CVS -prune \) -o \( -name .git 
-prune \) -o -print \) | grep -v "$sourcetree/doc/src/sgml/\+" > /dev/null
real0m0.019s
user0m0.008s
sys 0m0.017s

second:
time find "$sourcetree" -name Makefile -print -o -name GNUmakefile -print | 
grep -v "$sourcetree/doc/src/sgml/images/" > /dev/null

real0m0.118s
user0m0.071s
sys 0m0.053s

It think we could just obsolete the second find, by checking for the existence
of Makefile / GNUmakefile in the first loop...


The invocation of ln -s is quite measurable - looks like it's mostly the
process startup overhead (on linux, at least). Doing a ln --version > /dev/null
each iteration takes about the same time as actually creating the symlinks.

Greetings,

Andres Freund




Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan
On Fri, Apr 8, 2022 at 1:29 PM Robert Haas  wrote:
> Hmm. I wonder if we could teach the system to figure out which of
> those things is happening. In the case that I'm worried about, when
> we're considering growing the line pointer array, either the line
> pointers will be dead or the line pointers will be used but the tuples
> to which they point will be dead. In the case you describe here, there
> should be very few dead tuples or line pointers in the page. Maybe
> when the number of line pointers starts to get big, we refuse to add
> more without checking the number of dead tuples and dead line pointers
> and verifying that those numbers are still small. Or, uh, something.

It seems like the central idea is that we think in terms of "versions
per logical row", even in low level code that traditionally hasn't
made those kinds of distinctions.

Ideally we could structure pruning a little bit more like a B-Tree
page split, where there is an explicit "incoming tuple" that won't fit
(without splitting the page, or maybe doing some kind of deletion). If
the so-called incoming tuple that we'd rather like to fit on the page
is an insert of an unrelated row, don't allow it (don't prune, give
up). But if it's an update (especially a hot update), be much more
permissive about allowing it, and/or going ahead with pruning in order
to make sure it happens.

I like Andres' idea of altering LP_REDIRECTs just to be able to use up
lower line pointers first. Or preserving a few extra LP_UNUSED items
on insert. Those seem promising to me.

> One fly in the ointment is that if we refuse to expand the line
> pointer array, we might extend the relation instead, which is another
> kind of bloat and thus not great. But if the line pointer array is
> somehow filling up with tons of dead tuples, we're going to have to
> extend the relation anyway. I suspect that in some circumstances it's
> better to just accept that outcome and hope that it leads to some
> pages becoming empty, thus allowing their line pointer arrays to be
> reset.

I agree. Sometimes the problem is that we don't cut our losses when we
should -- sometimes just accepting a limited downside is the right
thing to do. Like with the FSM; we diligently use every last scrap of
free space, without concern for the bigger picture. It's penny-wise,
pound-foolish.

-- 
Peter Geoghegan




Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-08 Thread Tom Lane
Andres Freund  writes:
> On 2022-04-07 13:57:45 -0400, Tom Lane wrote:
>> Yeah, with only one instance it could just be cosmic rays or something.
>> However, assuming it is real, I guess I wonder why we don't say
>> CHECKPOINT_FORCE in standby mode too.

> I guess it might partially be that restartpoints require a checkpoint to have
> happened on the primary. If we used FORCE, we'd have to wait till the next
> checkpoint on the primary, which'd be a problem if it's e.g. a manually issued
> CHECKPOINT; before shutting the standby down.

After seeing skink's results, I tried running that test under valgrind
here, and it fails just like that every time.  skink's history allows
us to bound the failure introduction between 79b716cfb7 and
d7ab2a9a3c, which I think makes it just about certain that it was
5dc0418fab (Prefetch data referenced by the WAL, take II), though I've
not bisected to be 100% sure.

Adding some debug printouts to ExecuteRecoveryCommand convinces me
that indeed the archive_cleanup_command is NOT getting called by the
problematic CHECKPOINT command.  I surmise based on Andres' comment
above that the standby isn't making a restartpoint for lack of
an available primary checkpoint, which looks to me like it could be
a pre-existing bug in the test case: it's sure not doing anything to
guarantee that the primary's checkpoint record has reached the standby.

I tried adjusting the patch so it does guarantee that (as attached),
and in two out of two tries it got past the archive_cleanup_command
failure but then hung up waiting for standby2 to promote.

On the whole, I'm not sure that the WAL prefetch logic is noticeably
more stable than when we booted it out last year :-(.  However, I also
wonder why it is that this test case wasn't occasionally failing already.

regards, tom lane

diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index c8f5ffbaf0..1032d8a388 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -44,13 +44,14 @@ $node_standby->start;
 # Create some content on primary
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $current_lsn =
-  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
 
 # Note the presence of this checkpoint for the archive_cleanup_command
 # check done below, before switching to a new segment.
 $node_primary->safe_psql('postgres', "CHECKPOINT");
 
+my $current_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
 # Force archiving of WAL file to make it present on primary
 $node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");
 


Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan
On Fri, Apr 8, 2022 at 2:06 PM Andres Freund  wrote:
> It's not hard to hit scenarios where pages are effectively unusable, because
> they have close to 291 dead items, without autovacuum triggering (or
> autovacuum just taking a while).

I think that this is mostly a problem with HOT-updates, and regular
updates to a lesser degree. Deletes seem less troublesome.

I find that it's useful to think in terms of the high watermark number
of versions required for a given logical row over time. It's probably
quite rare for most individual logical rows to truly require more than
2 or 3 versions per row at the same time, to serve queries. Even in
update-heavy tables. And without doing anything fancy with the
definition of HeapTupleSatisfiesVacuum(). There are important
exceptions, certainly, but overall I think that we're still not doing
good enough with these easier cases.

The high watermark number of versions is probably going to be
significantly greater than the typical number of versions for the same
row. So maybe we give up on keeping a row on its original heap block
today, all because of a once-off (or very rare) event where we needed
slightly more extra space for only a fraction of a second.

The tell-tale sign of these kinds of problems can sometimes be seen
with synthetic, rate-limited benchmarks. If it takes a very long time
for the problem to grow, but nothing about the workload really ever
changes, then that suggests problems that have this quality.  The
probability of any given logical row being moved to another heap block
is very low. And yet it is inevitable that many (even all) will, given
enough time, given enough opportunities to get unlucky.

> This has become a bit more pronounced with vacuum skipping index cleanup when
> there's "just a few" dead items - if all your updates concentrate in a small
> region, 2% of the whole relation size isn't actually that small.

The 2% threshold was chosen based on the observation that it was below
the effective threshold where autovacuum just won't ever launch
anything on a moderate sized table (unless you set
autovacuum_vacuum_scale_factor to something absurdly low). The real
problem is that IMV. That's why I think that we need to drive it based
primarily on page-level characteristics. While effectively ignoring
pages that are all-visible when deciding if enough bloat is present to
necessitate vacuuming.

> 1) It's kind of OK for heap-only tuples to get a high OffsetNumber - we can
>reclaim them during pruning once they're dead. They don't leave behind a
>dead item that's unreclaimable until the next vacuum with an index cleanup
>pass.

I like the general direction here, but this particular idea doesn't
seem like a winner.

> 2) Arguably the OffsetNumber of a redirect target can be changed. It might
>break careless uses of WHERE ctid = ... though (which likely are already
>broken, just harder to hit).

That makes perfect sense to me, though.

> a) heap_page_prune_prune() should take the number of used items into account
>when deciding whether to prune. Right now we trigger hot pruning based on
>the number of items only if PageGetMaxOffsetNumber(page) >=
>MaxHeapTuplesPerPage. But because it requires a vacuum to reclaim an ItemId
>used for a root tuple, we should trigger HOT pruning when it might lower
>which OffsetNumber get used.

Unsure about this.

> b) heap_page_prune_prune() should be triggered in more paths. E.g. when
>inserting / updating, we should prune if it allows us to avoid using a high
>OffsetNumber.

Unsure about this too.

I prototyped a design that gives individual backends soft ownership of
heap blocks that were recently allocated, and later prunes the heap
page when it fills [1]. Useful for aborted transactions, where it
preserves locality -- leaving aborted tuples behind makes their space
ultimately reused for unrelated inserts, which is bad. But eager
pruning allows the inserter to leave behind more or less pristine heap
pages, which don't need to be pruned later on.

> c) What if we left some percentage of ItemIds unused, when looking for the
>OffsetNumber of a new HOT row version? That'd make it more likely for
>non-HOT updates and inserts to fit onto the page, without permanently
>increasing the size of the line pointer array.

That sounds promising.

[1] 
https://postgr.es/m/cah2-wzm-vhveqyth8hlyyho2wdg8ecrm0upqjwjap6bovfe...@mail.gmail.com
--
Peter Geoghegan




Re: MERGE bug report

2022-04-08 Thread Alvaro Herrera
On 2022-Apr-06, Richard Guo wrote:

> That's right. The varattno is set to zero for whole-row Var. And in this
> case these whole-row Vars are not included in the targetlist.
> 
> Attached is an attempt for the fix.

Wow, this is very interesting.  I was surprised that this patch was
necessary at all -- I mean, if wholerow refs don't work, then why do
references to any other columns work?  The answer is that parse_merge.c
is already setting up the subplan's targetlist by expanding all vars of
the source relation.  I then remembered than in Simon's (or Pavan's)
original coding, parse_merge.c had a hack to include a var with the
source's wholerow in that targetlist, which I had later removed ...

I eventually realized that there's no need for parse_merge.c to expand
the source rel at all, and indeed it's wasteful: we can just let
preprocess_targetlist include the vars that are referenced by either
quals or each action's targetlist instead.  That led me to the attached
patch, which is not commit-quality yet but it should show what I have in
mind.

I added a test query to tickle this problematic case.

Another point, not completely connected to this bug but appearing in the
same function, is that we have some redundant code: we can just let the
stanza for UPDATE/DELETE do the identity columns dance.  This saves a
few lines in the MERGE-specific stanza there, which was doing exactly
the same thing.  (There's a difference in the "inh" test, but I think
that was just outdated.)

I also discovered that the comment for fix_join_expr needed an update,
since it doesn't mention MERGE, and it does mention all other situations
in which it is used.  Added that too.


This patch is a comment about "aggregates, window functions and
placeholder vars".  This was relevant and correct when only the qual of
each action was being handled (i.e., Richard's patch).  Now that we're
also handling the action's targetlist, I think I need to put the PVC
flags back.  But no tests broke, which probably means we also need some
additional tests cases.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 6ea3505646..fa5969bbd5 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -2763,6 +2763,10 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  *	  to-be-updated relation) alone. Correspondingly inner_itlist is to be
  *	  EXCLUDED elements, outer_itlist = NULL and acceptable_rel the target
  *	  relation.
+ * 4) MERGE.  In this case, references to the source relation are to be
+ *replaced with INNER_VAR references, and target Vars (the to-be-
+ *modified relation) are left alone. So inner_itlist is to be
+ *the source relation and acceptable_rel the target relatin.
  *
  * 'clauses' is the targetlist or list of join clauses
  * 'outer_itlist' is the indexed target list of the outer join relation,
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 99ab3d7559..c1c3067365 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -107,14 +107,15 @@ preprocess_targetlist(PlannerInfo *root)
 		root->update_colnos = extract_update_targetlist_colnos(tlist);
 
 	/*
-	 * For non-inherited UPDATE/DELETE, register any junk column(s) needed to
-	 * allow the executor to identify the rows to be updated or deleted.  In
-	 * the inheritance case, we do nothing now, leaving this to be dealt with
-	 * when expand_inherited_rtentry() makes the leaf target relations.  (But
-	 * there might not be any leaf target relations, in which case we must do
-	 * this in distribute_row_identity_vars().)
+	 * For non-inherited UPDATE/DELETE/MERGE, register any junk column(s)
+	 * needed to allow the executor to identify the rows to be updated or
+	 * deleted.  In the inheritance case, we do nothing now, leaving this to
+	 * be dealt with when expand_inherited_rtentry() makes the leaf target
+	 * relations.  (But there might not be any leaf target relations, in which
+	 * case we must do this in distribute_row_identity_vars().)
 	 */
-	if ((command_type == CMD_UPDATE || command_type == CMD_DELETE) &&
+	if ((command_type == CMD_UPDATE || command_type == CMD_DELETE ||
+		 command_type == CMD_MERGE) &&
 		!target_rte->inh)
 	{
 		/* row-identity logic expects to add stuff to processed_tlist */
@@ -125,23 +126,15 @@ preprocess_targetlist(PlannerInfo *root)
 	}
 
 	/*
-	 * For MERGE we need to handle the target list for the target relation,
-	 * and also target list for each action (only INSERT/UPDATE matter).
+	 * For MERGE we also need to handle the target list for each INSERT and
+	 * UPDATE action separately.  In addition, we examine the qual of each
+	 * action and add any Vars there (other than those of the target rel) to
+	 * the subplan targetlist.
 	 */
 	if (command_type == CMD_MERGE)
 	{
 		

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan
On Fri, Apr 8, 2022 at 2:18 PM Andres Freund  wrote:
> It's 4 bytes per line pointer, right?

Yeah, it's 4 bytes in Postgres. Most other DB systems only need 2
bytes, which is implemented in exactly the way that you're imagining.

-- 
Peter Geoghegan




Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Andres Freund
Hi,

On 2022-04-08 15:04:37 -0400, Robert Haas wrote:
> I meant wasting space in the page. I think that's a real concern.
> Imagine you allowed 1000 line pointers per page. Each one consumes 2
> bytes.

It's 4 bytes per line pointer, right?

struct ItemIdData {
unsigned int   lp_off:15;/* 0: 0  4 */
unsigned int   lp_flags:2;   /* 0:15  4 */
unsigned int   lp_len:15;/* 0:17  4 */

/* size: 4, cachelines: 1, members: 3 */
/* last cacheline: 4 bytes */
};

Or am I confusing myself somehow?


I do wish the length of the tuple weren't in ItemIdData, but part of the
tuple, so we'd not waste this space for dead items (I think it'd also simplify
more code than it'd complicate). But ...

- Andres




Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Andres Freund
Hi,

On 2022-04-08 09:17:40 -0400, Robert Haas wrote:
> I agree that the value of 291 is pretty much accidental, but it also
> seems fairly generous to me. The bigger you make it, the more space
> you can waste. I must have missed (or failed to understand) previous
> discussions about why raising it would be a good idea.

It's not hard to hit scenarios where pages are effectively unusable, because
they have close to 291 dead items, without autovacuum triggering (or
autovacuum just taking a while). You basically just need updates / deletes to
concentrate in a certain range of the table and have indexing that prevents
HOT updates. Because the overall percentage of dead tuples is low, no
autovacuum is triggered, yet a range of the table contains little but dead
items.  At which point you basically waste 7k bytes (1164 bytes for dead items
IIRC) until a vacuum finally kicks in - way more than what what you'd waste if
the number of line items were limited at e.g. 2 x MaxHeapTuplesPerPage

This has become a bit more pronounced with vacuum skipping index cleanup when
there's "just a few" dead items - if all your updates concentrate in a small
region, 2% of the whole relation size isn't actually that small.


I wonder if we could reduce the real-world space wastage of the line pointer
array, if we changed the the logic about which OffsetNumbers to use during
inserts / updates and and made a few tweaks to to pruning.

1) It's kind of OK for heap-only tuples to get a high OffsetNumber - we can
   reclaim them during pruning once they're dead. They don't leave behind a
   dead item that's unreclaimable until the next vacuum with an index cleanup
   pass.

2) Arguably the OffsetNumber of a redirect target can be changed. It might
   break careless uses of WHERE ctid = ... though (which likely are already
   broken, just harder to hit).

These leads me to a few potential improvements:

a) heap_page_prune_prune() should take the number of used items into account
   when deciding whether to prune. Right now we trigger hot pruning based on
   the number of items only if PageGetMaxOffsetNumber(page) >=
   MaxHeapTuplesPerPage. But because it requires a vacuum to reclaim an ItemId
   used for a root tuple, we should trigger HOT pruning when it might lower
   which OffsetNumber get used.

b) heap_page_prune_prune() should be triggered in more paths. E.g. when
   inserting / updating, we should prune if it allows us to avoid using a high
   OffsetNumber.

c) What if we left some percentage of ItemIds unused, when looking for the
   OffsetNumber of a new HOT row version? That'd make it more likely for
   non-HOT updates and inserts to fit onto the page, without permanently
   increasing the size of the line pointer array.

d) If we think 2) is acceptable, we could move the targets of redirects to
   make space for new root tuples, without increasing the permanent size of
   the line pointer array.

Crazy?

Greetings,

Andres Freund




Re: Pre-allocating WAL files

2022-04-08 Thread Nathan Bossart
On Thu, Mar 17, 2022 at 04:12:12PM -0700, Nathan Bossart wrote:
> It seems unlikely that this will be committed for v15, so I've adjusted the
> commitfest entry to v16 and moved it to the next commitfest.

rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3781795f9b4e448df6bdd24d5cd7c0743b5e2944 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 10 Nov 2021 18:35:14 +
Subject: [PATCH v8 1/2] Move WAL segment creation logic to its own function.

---
 src/backend/access/transam/xlog.c | 103 +--
 src/backend/storage/file/fd.c | 114 ++
 src/include/storage/fd.h  |   1 +
 3 files changed, 116 insertions(+), 102 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a7814d4019..87d71e2008 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2918,11 +2918,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 bool *added, char *path)
 {
 	char		tmppath[MAXPGPATH];
-	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
-	int			save_errno;
 
 	Assert(logtli != 0);
 
@@ -2952,106 +2950,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	elog(DEBUG2, "creating and filling new WAL file");
 
 	snprintf(tmppath, MAXPGPATH, XLOGDIR "/xlogtemp.%d", (int) getpid());
-
-	unlink(tmppath);
-
-	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not create file \"%s\": %m", tmppath)));
-
-	memset(zbuffer.data, 0, XLOG_BLCKSZ);
-
-	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-	save_errno = 0;
-	if (wal_init_zero)
-	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
-
-		/*
-		 * Zero-fill the file.  With this setting, we do this the hard way to
-		 * ensure that all the file space has really been allocated.  On
-		 * platforms that allow "holes" in files, just seeking to the end
-		 * doesn't allocate intermediate space.  This way, we know that we
-		 * have all the space and (after the fsync below) that all the
-		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
-		 * O_DSYNC will be sufficient to sync future writes to the log file.
-		 */
-
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
-		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
-		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
-		{
-			int			iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
-
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
-			{
-save_errno = errno;
-break;
-			}
-
-			i += iovcnt;
-		}
-	}
-	else
-	{
-		/*
-		 * Otherwise, seeking to the end and writing a solitary byte is
-		 * enough.
-		 */
-		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
-		{
-			/* if write didn't set errno, assume no disk space */
-			save_errno = errno ? errno : ENOSPC;
-		}
-	}
-	pgstat_report_wait_end();
-
-	if (save_errno)
-	{
-		/*
-		 * If we fail to make the file, delete it to release disk space
-		 */
-		unlink(tmppath);
-
-		close(fd);
-
-		errno = save_errno;
-
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not write to file \"%s\": %m", tmppath)));
-	}
-
-	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
-	{
-		int			save_errno = errno;
-
-		close(fd);
-		errno = save_errno;
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not fsync file \"%s\": %m", tmppath)));
-	}
-	pgstat_report_wait_end();
-
-	if (close(fd) != 0)
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not close file \"%s\": %m", tmppath)));
+	CreateEmptyWalSegment(tmppath);
 
 	/*
 	 * Now move the segment into place with its final name.  Cope with
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..4efc46460e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3891,3 +3891,117 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 
 	return sum;
 }
+
+/*
+ * CreateEmptyWalSegment
+ *
+ * Create a new file that can be used as a new WAL segment.  The caller is
+ * responsible for installing the new file in pg_wal.
+ */
+void
+CreateEmptyWalSegment(const char *path)
+{
+	PGAlignedXLogBlock zbuffer;
+	int			fd;
+	int			save_errno;
+
+	unlink(path);
+
+	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
+	fd = BasicOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not create 

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 3:31 PM Peter Geoghegan  wrote:
> What if we miss the opportunity to systematically keep successor
> versions of a given logical row on the same heap page over time, due
> only to the current low MaxHeapLinePointersPerPage limit of 291? If we
> had only been able to "absorb" just a few extra versions in the short
> term, we would have had stability (in the sense of being able to
> preserve locality among related logical rows) in the long term. We
> could have kept everything together, if only we didn't overreact to
> what were actually short term, rare perturbations.

Hmm. I wonder if we could teach the system to figure out which of
those things is happening. In the case that I'm worried about, when
we're considering growing the line pointer array, either the line
pointers will be dead or the line pointers will be used but the tuples
to which they point will be dead. In the case you describe here, there
should be very few dead tuples or line pointers in the page. Maybe
when the number of line pointers starts to get big, we refuse to add
more without checking the number of dead tuples and dead line pointers
and verifying that those numbers are still small. Or, uh, something.

One fly in the ointment is that if we refuse to expand the line
pointer array, we might extend the relation instead, which is another
kind of bloat and thus not great. But if the line pointer array is
somehow filling up with tons of dead tuples, we're going to have to
extend the relation anyway. I suspect that in some circumstances it's
better to just accept that outcome and hope that it leads to some
pages becoming empty, thus allowing their line pointer arrays to be
reset.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Frontend error logging style

2022-04-08 Thread Tom Lane
I wrote:
> One other loose end is bothering me: I stuck with logging.h's
> original choice to put "if (likely())" or "if (unlikely())"
> conditionals into the macros, but I rather suspect that that's
> just a waste.  I think we should put a centralized level check
> into logging.c, and get rid of at least the "if (likely())"
> checks, because those are going to succeed approximately 100.0%
> of the time.  Maybe there's an argument for keeping the unlikely()
> ones.

Concretely, something like the attached.  As a simple check,
I looked at the compiled size of pg_dump.  It went from

  textdata bss dec hex filename
 38029840081384  385690   5e29a /home/postgres/testversion/bin/pg_dump

to

   textdata bss dec hex filename
 37495440081384  380346   5cdba src/bin/pg_dump/pg_dump

for a savings of about 5K or 1.5%.  Not a huge amount, but
not nothing either, especially considering that the existing
coding isn't buying us anything.

regards, tom lane

diff --git a/src/bin/pg_dump/pg_backup_utils.h b/src/bin/pg_dump/pg_backup_utils.h
index 5b1c51554d..8173bb93cf 100644
--- a/src/bin/pg_dump/pg_backup_utils.h
+++ b/src/bin/pg_dump/pg_backup_utils.h
@@ -34,8 +34,7 @@ extern void exit_nicely(int code) pg_attribute_noreturn();
 /* In pg_dump, we modify pg_fatal to call exit_nicely instead of exit */
 #undef pg_fatal
 #define pg_fatal(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
 		exit_nicely(1); \
 	} while(0)
 
diff --git a/src/common/logging.c b/src/common/logging.c
index 18d6669f27..8a061f46b4 100644
--- a/src/common/logging.c
+++ b/src/common/logging.c
@@ -223,6 +223,10 @@ pg_log_generic_v(enum pg_log_level level, enum pg_log_part part,
 	Assert(fmt);
 	Assert(fmt[strlen(fmt) - 1] != '\n');
 
+	/* Do nothing if log level is too low. */
+	if (level < __pg_log_level)
+		return;
+
 	/*
 	 * Flush stdout before output to stderr, to ensure sync even when stdout
 	 * is buffered.
diff --git a/src/include/common/logging.h b/src/include/common/logging.h
index e213bb70d0..35c7c7b976 100644
--- a/src/include/common/logging.h
+++ b/src/include/common/logging.h
@@ -104,48 +104,39 @@ void		pg_log_generic_v(enum pg_log_level level, enum pg_log_part part,
  * pg_log_generic[_v] directly, except perhaps in error interface code.
  */
 #define pg_log_error(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_error_detail(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_DETAIL, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_DETAIL, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_error_hint(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_HINT, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_HINT, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_warning(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_WARNING)) \
-			pg_log_generic(PG_LOG_WARNING, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_WARNING, PG_LOG_PRIMARY, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_warning_detail(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_WARNING)) \
-			pg_log_generic(PG_LOG_WARNING, PG_LOG_DETAIL, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_WARNING, PG_LOG_DETAIL, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_warning_hint(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_WARNING)) \
-			pg_log_generic(PG_LOG_WARNING, PG_LOG_HINT, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_WARNING, PG_LOG_HINT, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_info(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_INFO)) \
-			pg_log_generic(PG_LOG_INFO, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_INFO, PG_LOG_PRIMARY, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_info_detail(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_INFO)) \
-			pg_log_generic(PG_LOG_INFO, PG_LOG_DETAIL, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_INFO, PG_LOG_DETAIL, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_info_hint(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_INFO)) \
-			pg_log_generic(PG_LOG_INFO, PG_LOG_HINT, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_INFO, PG_LOG_HINT, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_debug(...) do { \
@@ -167,8 +158,7 @@ void		pg_log_generic_v(enum pg_log_level level, enum pg_log_part part,
  * A common shortcut: pg_log_error() and immediately exit(1).
  */
 #define pg_fatal(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
 		exit(1); \
 	} while(0)
 


Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-08 Thread Andrei Zubkov
Hi,

I've rebased this patch so that it can be applied after 57d6aea00fc.

v14 attached
--
regards, Andrei
From 6c541f3001d952e72e5d865fde09de3fb4f36d10 Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Fri, 8 Apr 2022 23:12:55 +0300
Subject: [PATCH] pg_stat_statements: Track statement entry timestamp

This patch adds stats_since and minmax_stats_since columns to the
pg_stat_statements view and pg_stat_statements() function. The new min/max reset
mode for the pg_stat_stetments_reset() function is controlled by the
parameter minmax_only.
stat_since column is populated with the current timestamp when a new statement
is added to the pg_stat_statements hashtable. It provides clean information
about statistics collection time interval for each statement. Besides it can be
used by sampling solutions to detect situations when a statement was evicted and
stored again between samples.
Such sampling solution could derive any pg_stat_statements statistic values for
an interval between two samples with the exception of all min/max statistics. To
address this issue this patch adds the ability to reset min/max statistics
independently of the statement reset using the new minmax_only parameter
of the pg_stat_statements_reset(userid oid, dbid oid, queryid bigint,
minmax_only boolean) function. Timestamp of such reset is stored in the
minmax_stats_since field for each statement.
pg_stat_statements_reset() function now returns the timestamp of a reset as a
result.

Discussion:
https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 .../expected/oldextversions.out   | 118 +++---
 .../expected/pg_stat_statements.out   | 361 +-
 .../pg_stat_statements--1.9--1.10.sql |  16 +-
 .../pg_stat_statements/pg_stat_statements.c   | 128 +--
 .../pg_stat_statements/sql/oldextversions.sql |   7 +-
 .../sql/pg_stat_statements.sql| 149 +++-
 doc/src/sgml/pgstatstatements.sgml|  66 +++-
 7 files changed, 637 insertions(+), 208 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index efb2049ecff..0634d73bc03 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -138,7 +138,7 @@ SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
 
 -- New function pg_stat_statement_info, and new function
 -- and view for pg_stat_statements introduced in 1.9
-AlTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+ALTER EXTENSION pg_stat_statements UPDATE TO '1.9';
 SELECT pg_get_functiondef('pg_stat_statements_info'::regproc);
pg_get_functiondef
 -
@@ -194,55 +194,79 @@ SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
  t
 (1 row)
 
+\d pg_stat_statements_info
+  View "public.pg_stat_statements_info"
+   Column|   Type   | Collation | Nullable | Default 
+-+--+---+--+-
+ dealloc | bigint   |   |  | 
+ stats_reset | timestamp with time zone |   |  | 
+
+SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
+   pg_get_functiondef   
+
+ CREATE OR REPLACE FUNCTION public.pg_stat_statements_reset(userid oid DEFAULT 0, dbid oid DEFAULT 0, queryid bigint DEFAULT 0)+
+  RETURNS void +
+  LANGUAGE c   +
+  PARALLEL SAFE STRICT +
+ AS '$libdir/pg_stat_statements', $function$pg_stat_statements_reset_1_7$function$ +
+ 
+(1 row)
+
+SET SESSION AUTHORIZATION pg_read_all_stats;
+SELECT pg_stat_statements_reset();
+ERROR:  permission denied for function pg_stat_statements_reset
+RESET SESSION AUTHORIZATION;
 -- New functions and views for pg_stat_statements in 1.10
 AlTER EXTENSION pg_stat_statements UPDATE TO '1.10';
 \d pg_stat_statements
-  View "public.pg_stat_statements"
- Column |   Type   | Collation | Nullable | Default 
-+--+---+--+-
- userid | oid  |   

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-04-08 Thread Nathan Bossart
On Wed, Mar 30, 2022 at 09:21:30AM -0700, Nathan Bossart wrote:
> Here is an updated patch set.

rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a92ccf47c9c334eea5b8d07b8fcab7031181c37e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 15 Feb 2022 09:40:53 -0800
Subject: [PATCH v12 1/2] make more use of get_dirent_type()

---
 src/backend/access/heap/rewriteheap.c |  4 +-
 .../replication/logical/reorderbuffer.c   | 12 +++---
 src/backend/replication/logical/snapbuild.c   |  5 +--
 src/backend/replication/slot.c|  4 +-
 src/backend/storage/file/copydir.c| 21 +++---
 src/backend/storage/file/fd.c | 20 +
 src/backend/utils/misc/guc-file.l | 42 +++
 src/timezone/pgtz.c   |  8 +---
 8 files changed, 48 insertions(+), 68 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736..d64d7aae2e 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -113,6 +113,7 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1213,7 +1214,6 @@ CheckPointLogicalRewriteHeap(void)
 	mappings_dir = AllocateDir("pg_logical/mappings");
 	while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
 	{
-		struct stat statbuf;
 		Oid			dboid;
 		Oid			relid;
 		XLogRecPtr	lsn;
@@ -1227,7 +1227,7 @@ CheckPointLogicalRewriteHeap(void)
 			continue;
 
 		snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
-		if (lstat(path, ) == 0 && !S_ISREG(statbuf.st_mode))
+		if (get_dirent_type(path, mapping_de, false, ERROR) != PGFILETYPE_REG)
 			continue;
 
 		/* Skip over files that cannot be ours. */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5adc016d44..63ef55f3f7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -91,6 +91,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -4408,15 +4409,10 @@ ReorderBufferCleanupSerializedTXNs(const char *slotname)
 {
 	DIR		   *spill_dir;
 	struct dirent *spill_de;
-	struct stat statbuf;
 	char		path[MAXPGPATH * 2 + 12];
 
 	sprintf(path, "pg_replslot/%s", slotname);
 
-	/* we're only handling directories here, skip if it's not ours */
-	if (lstat(path, ) == 0 && !S_ISDIR(statbuf.st_mode))
-		return;
-
 	spill_dir = AllocateDir(path);
 	while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL)
 	{
@@ -4464,6 +4460,7 @@ StartupReorderBuffer(void)
 {
 	DIR		   *logical_dir;
 	struct dirent *logical_de;
+	char		path[MAXPGPATH * 2 + 12];
 
 	logical_dir = AllocateDir("pg_replslot");
 	while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
@@ -4476,6 +4473,11 @@ StartupReorderBuffer(void)
 		if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2))
 			continue;
 
+		/* we're only handling directories here, skip if it's not ours */
+		sprintf(path, "pg_replslot/%s", logical_de->d_name);
+		if (get_dirent_type(path, logical_de, false, ERROR) != PGFILETYPE_DIR)
+			continue;
+
 		/*
 		 * ok, has to be a surviving logical slot, iterate and delete
 		 * everything starting with xid-*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 83fca8a77d..df50abfd98 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -123,6 +123,7 @@
 #include "access/heapam_xlog.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "replication/logical.h"
@@ -1947,15 +1948,13 @@ CheckPointSnapBuild(void)
 		uint32		hi;
 		uint32		lo;
 		XLogRecPtr	lsn;
-		struct stat statbuf;
 
 		if (strcmp(snap_de->d_name, ".") == 0 ||
 			strcmp(snap_de->d_name, "..") == 0)
 			continue;
 
 		snprintf(path, sizeof(path), "pg_logical/snapshots/%s", snap_de->d_name);
-
-		if (lstat(path, ) == 0 && !S_ISREG(statbuf.st_mode))
+		if (get_dirent_type(path, snap_de, false, ERROR) != PGFILETYPE_REG)
 		{
 			elog(DEBUG1, "only regular files expected: %s", path);
 			continue;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c35ea7c35b..5ee68e71b8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1485,7 +1486,6 @@ StartupReplicationSlots(void)
 	

Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Nathan Bossart
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I've attached a patch that simply removes durable_rename_excl() and
replaces existing calls with durable_rename().  I noticed that Andres
expressed similar misgivings about durable_rename_excl() last year [0] [1].
I can create a stat-then-durable-rename version of this for back-patching
if that is still the route we want to go.

[0] https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de
[1] https://postgr.es/m/20210318023004.gz2aejhze2kkkqr2%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d3c633e19555dc0cf98207ad5e7c08ab9ce85dc0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 8 Apr 2022 11:48:17 -0700
Subject: [PATCH v2 1/1] Remove durable_rename_excl().

durable_rename_excl() attempts to avoid overwriting any existing
files by using link() and unlink(), but it falls back to rename()
on some platforms (e.g., Windows), which offers no such ovewrite
protection.  Most callers use durable_rename_excl() just in case
there is an existing file, but in practice there shouldn't be one.
basic_archive uses it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  My testing demonstrated that
it was possible to end up with two links to the same file in pg_wal
after a crash just before unlink() during WAL recycling.
Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL
file was re-recycled upon restarting.  This seems likely to lead to
WAL corruption.

This change removes durable_rename_excl() and replaces all existing
calls with durable_rename().  This removes the protection against
accidentally overwriting an existing file, but some platforms are
already living without it, and ordinarily there shouldn't be one.

Author: Nathan Bossart
Reviewed-by: Robert Haas
Discussion: https://postgr.es/m/20220407182954.GA1231544%40nathanxps13
---
 contrib/basic_archive/basic_archive.c |  5 ++-
 src/backend/access/transam/timeline.c | 14 +-
 src/backend/access/transam/xlog.c |  8 +---
 src/backend/storage/file/fd.c | 63 ---
 src/include/pg_config_manual.h|  7 ---
 src/include/storage/fd.h  |  1 -
 6 files changed, 7 insertions(+), 91 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index e7efbfb9c3..ed33854c57 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path)
 
 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);
 
 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..128f754e87 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -519,12 +514,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan
On Fri, Apr 8, 2022 at 12:04 PM Robert Haas  wrote:
> I meant wasting space in the page. I think that's a real concern.
> Imagine you allowed 1000 line pointers per page. Each one consumes 2
> bytes. So now you could have ~25% of each page in the table storing
> dead line pointers. That sounds awful, and just running VACUUM won't
> fix it once it's happened, because the still-live line pointers are
> likely to be at the end of the line pointer array and thus truncating
> it won't necessarily be possible.

I see. That's a legitimate concern, though one that I believe can be
addressed. I have learned to dread any kind of bloat that's
irreversible, no matter how minor it might seem when seen as an
isolated event, so I'm certainly sympathetic to these concerns. You
can make a similar argument in favor of a higher
MaxHeapLinePointersPerPage limit, though -- and that's why I believe
an increase of some kind makes sense. The argument goes like this:

What if we miss the opportunity to systematically keep successor
versions of a given logical row on the same heap page over time, due
only to the current low MaxHeapLinePointersPerPage limit of 291? If we
had only been able to "absorb" just a few extra versions in the short
term, we would have had stability (in the sense of being able to
preserve locality among related logical rows) in the long term. We
could have kept everything together, if only we didn't overreact to
what were actually short term, rare perturbations.

-- 
Peter Geoghegan




Re: Frontend error logging style

2022-04-08 Thread Tom Lane
Daniel Gustafsson  writes:
> On 30 Mar 2022, at 00:38, Tom Lane  wrote:
>> Feel free to work on a followup editing patch though.

> Thats my plan, once this lands I'll rebase the comments on top of your work 
> and
> we can have a separate discussion around them then.

The main patch is pushed now.  I addressed the complaint Peter had
about the messages with "Check your installation" pseudo-hints
by getting rid of them; I concur with your observation that those
hints were basically useless.  I also fixed the one place where the
message should clearly be "could not close" not "could not write".
Mostly didn't yield to temptation anywhere else.

One other loose end is bothering me: I stuck with logging.h's
original choice to put "if (likely())" or "if (unlikely())"
conditionals into the macros, but I rather suspect that that's
just a waste.  I think we should put a centralized level check
into logging.c, and get rid of at least the "if (likely())"
checks, because those are going to succeed approximately 100.0%
of the time.  Maybe there's an argument for keeping the unlikely()
ones.

regards, tom lane




Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 12:57 PM Peter Geoghegan  wrote:
> What do you mean about wasting space? Wasting space on the stack? I
> can't imagine you meant wasting space on the page, since being able to
> accomodate more items on each heap page seems like it would be
> strictly better, barring any unintended weird FSM issues.

I meant wasting space in the page. I think that's a real concern.
Imagine you allowed 1000 line pointers per page. Each one consumes 2
bytes. So now you could have ~25% of each page in the table storing
dead line pointers. That sounds awful, and just running VACUUM won't
fix it once it's happened, because the still-live line pointers are
likely to be at the end of the line pointer array and thus truncating
it won't necessarily be possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan
On Fri, Apr 8, 2022 at 9:44 AM Peter Geoghegan  wrote:
> On Fri, Apr 8, 2022 at 4:38 AM Matthias van de Meent
>  wrote:
> > Yeah, I think we should definately support more line pointers on a
> > heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
> > the current value is the physical limit for heap tuples, as we have at
> > most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
> > won't change. A macro MaxHeapLinePointersPerPage would probably be
> > more useful, which could be as follows (assuming we don't want to
> > allow filling a page with effectively only dead line pointers):
>
> That's a good point. Sounds like it might be the right approach.
>
> I suppose that it will depend on how much use of MaxHeapTuplesPerPage
> remains once it is split in two like this.

Thinking about this some more, I wonder if it would make sense to
split MaxHeapTuplesPerPage into two new constants (a true
MaxHeapTuplesPerPage, plus MaxHeapLinePointersPerPage), for the
reasons discussed, but also as a way of getting a *smaller* effective
MaxHeapTuplesPerPage than 291 in some contexts only.

There are some ways in which the current MaxHeapTuplesPerPage isn't
enough, but also some ways in which it is excessive. It might be
useful if PageGetHeapFreeSpace() usually considered a heap page to
have no free space if the number of tuples with storage (or some cheap
proxy thereof) was about 227, which is the largest number of distinct
heap tuples that can *plausibly* ever be stored on an 8KiB page (it
ignores zero column tables). Most current PageGetHeapFreeSpace()
callers (including VACUUM) would continue to call that function in the
same way as today, and get this lower limit.

A few of the existing PageGetHeapFreeSpace() callers could store more
line pointers than that (MaxHeapLinePointersPerPage, which might be
510 in practice) -- but only those involved in updates. The overall
idea is to recognize that free space is not interchangeable -- updates
should have some kind of advantage over plain inserts when it comes to
the space on the page of the tuple that they're updating.

We might even want to make our newly defined, lower
MaxHeapTuplesPerPage into a tunable storage param.

-- 
Peter Geoghegan




Re: trigger example for plsample

2022-04-08 Thread Mark Wong
On Thu, Apr 07, 2022 at 06:29:53PM -0400, Tom Lane wrote:
> Chapman Flack  writes:
> > v4 looks good to me.
> 
> Pushed with very minor editorialization.  Mainly, I undid the
> decision to stop printing the function source text, on the
> grounds that (1) it falsified the comment immediately above,
> and (2) if you have to print it anyway to avoid compiler warnings,
> you're just creating confusing inconsistency between the two
> handler functions.

Sounds good to me, thanks!

Regards,
Mark




Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Nathan Bossart
On Fri, Apr 08, 2022 at 09:53:12AM -0700, Nathan Bossart wrote:
> On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
>> I'd actually be in favor of nuking durable_rename_excl() from orbit
>> and putting the file-exists tests in the callers. Otherwise, someone
>> might assume that it actually has the semantics that its name
>> suggests, which could be pretty disastrous. If we don't want to do
>> that, then I'd changing to do the stat-then-durable-rename thing
>> internally, so we don't leave hard links lying around in *any* code
>> path. Perhaps that's the right answer for the back-branches in any
>> case, since there could be third-party code calling this function.
> 
> I think there might be another problem.  The man page for rename() seems to
> indicate that overwriting an existing file also introduces a window where
> the old and new path are hard links to the same file.  This isn't a problem
> for the WAL files because we should never be overwriting an existing one,
> but I wonder if it's a problem for other code paths.  My guess is that many
> code paths that overwrite an existing file are first writing changes to a
> temporary file before atomically replacing the original.  Those paths are
> likely okay, too, as you can usually just discard any existing temporary
> files.

Ha, so there are only a few callers of durable_rename_excl() in the
PostgreSQL tree.  One is basic_archive.c, which is already doing a stat()
check.  IIRC I only used durable_rename_excl() here to handle the case
where multiple servers are writing archives to the same location.  If that
happened, the archiver process would begin failing.  If a crash left two
hard links to the same file around, we will silently succeed the next time
around thanks to the compare_files() check.  Besides the WAL installation
code, the only other callers are in timeline.c, and both note that the use
of durable_rename_excl() is for "paranoidly trying to avoid overwriting an
existing file (there shouldn't be one)."

So AFAICT basic_archive.c is the only caller with a strong reason for using
durable_rename_excl(), and even that might not be worth keeping it around.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-04-08 Thread Etsuro Fujita
Hi,

On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby  wrote:
> This patch seems to be causing the planner to crash.
> Here's a query reduced from sqlsmith.
>
> | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= 
> pg_trigger_depth();
>
> Program terminated with signal SIGSEGV, Segmentation fault.

Reproduced.  Will look into this.

Thanks for the report!

Best regards,
Etsuro Fujita




Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan
On Fri, Apr 8, 2022 at 6:17 AM Robert Haas  wrote:
> I agree that the value of 291 is pretty much accidental, but it also
> seems fairly generous to me. The bigger you make it, the more space
> you can waste. I must have missed (or failed to understand) previous
> discussions about why raising it would be a good idea.

What do you mean about wasting space? Wasting space on the stack? I
can't imagine you meant wasting space on the page, since being able to
accomodate more items on each heap page seems like it would be
strictly better, barring any unintended weird FSM issues.

As far as I know the only real downside to increasing it is the impact
on tidbitmap.c. Increasing the number of valid distinct TID values
might have a negative impact on performance during bitmap scans, which
will need to be managed. However, I don't think that increased stack
space usage will be a problem, with a little work. It either won't
matter at all (e.g. an array of offset numbers on the stack still
won't be very big), or it can be fixed locally where it turns out to
matter (like in lazy_scan_prune).

We used to routinely use MaxOffsetNumber for arrays of item offset
numbers. I cut down on that in the B-Tree code, reducing it to
MaxIndexTuplesPerPage (which is typically 407) in a few places. So
anything close to our current MaxIndexTuplesPerPage ought to be fine
for most individual arrays stored on the stack.

-- 
Peter Geoghegan




Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Nathan Bossart
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I see that durable_rename_excl() has the following comment: "Similar
> to durable_rename(), except that this routine tries (but does not
> guarantee) not to overwrite the target file." If those are the desired
> semantics, we could achieve them more simply and more safely by just
> trying to stat() the target file and then, if it's not found, call
> durable_rename(). I think that would be a heck of a lot safer than
> what this function is doing right now.

IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined.  If not, it provides no guarantees at
all.  Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().

> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I think there might be another problem.  The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file.  This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths.  My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original.  Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

> Your proposed fix is OK if we don't want to do any of that stuff, but
> personally I'm much more inclined to blame durable_rename_excl() for
> being horrible than I am to blame the calling code for using it
> improvidently.

I do agree that it's worth examining this stuff a bit closer.  I've
frequently found myself trying to reason about all the different states
that callers of these functions can produce, so any changes that help
simplify matters are a win in my book.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?

2022-04-08 Thread SATYANARAYANA NARLAPURAM
On Fri, Apr 8, 2022 at 6:44 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat
>  wrote:
> >
> > On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
> >  wrote:
> > >
> > > Hi,
> > >
> > > I'm thinking if there's a way in core postgres to achieve $subject. In
> > > reality, the sync/async standbys can either be closer/farther (which
> > > means sync/async standbys can receive WAL at different times) to
> > > primary, especially in cloud HA environments with primary in one
> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > > $subject may not be possible on dev systems (say, for testing some HA
> > > features) unless we can inject a delay in WAL senders before sending
> > > WAL.
>

Simulation will be helpful even for end customers to simulate faults in the
production environments during availability zone/disaster recovery drills.



> > >
> > > How about having two developer-only GUCs {async,
> > > sync}_wal_sender_delay? When set, the async and sync WAL senders will
> > > delay sending WAL by {async, sync}_wal_sender_delay
> > > milliseconds/seconds? Although, I can't think of any immediate use, it
> > > will be useful someday IMO, say for features like [1], if it gets in.
> > > With this set of GUCs, one can even add core regression tests for HA
> > > features.
>

I would suggest doing this at the slot level, instead of two GUCs that
control the behavior of all the slots (physical/logical). Something like
"pg_suspend_replication_slot and pg_Resume_replication_slot"?
Alternatively a GUC on the standby side instead of primary so that the wal
receiver stops responding to the wal sender? This helps achieve the same as
above but the granularity is now at individual replica level.

Thanks,
Satya


Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan
On Fri, Apr 8, 2022 at 4:38 AM Matthias van de Meent
 wrote:
> Yeah, I think we should definately support more line pointers on a
> heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
> the current value is the physical limit for heap tuples, as we have at
> most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
> won't change. A macro MaxHeapLinePointersPerPage would probably be
> more useful, which could be as follows (assuming we don't want to
> allow filling a page with effectively only dead line pointers):

That's a good point. Sounds like it might be the right approach.

I suppose that it will depend on how much use of MaxHeapTuplesPerPage
remains once it is split in two like this.

-- 
Peter Geoghegan




Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-08 Thread Tom Lane
Andres Freund  writes:
> On 2022-04-07 13:57:45 -0400, Tom Lane wrote:
>> Yeah, with only one instance it could just be cosmic rays or something.

Not cosmic rays: skink has shown the same symptom three times running.
Looks like maybe the archive_cleanup_command itself is doing something
it shouldn't?

regards, tom lane




Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-04-08 Thread Zhihong Yu
On Fri, Apr 8, 2022 at 5:43 AM Justin Pryzby  wrote:

> On Wed, Apr 06, 2022 at 03:58:29PM +0900, Etsuro Fujita wrote:
> > I have committed the patch after modifying it as such.  (I think we
> > can improve these later, if necessary.)
>
> This patch seems to be causing the planner to crash.
> Here's a query reduced from sqlsmith.
>
> | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1
> <= pg_trigger_depth();
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x55b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at
> ../../../../src/include/nodes/pg_list.h:151
> 151 return l ? l->length : 0;
> (gdb) bt
> #0  0x55b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at
> ../../../../src/include/nodes/pg_list.h:151
> #1  0x55b43968af89 in mark_async_capable_plan 
> (plan=plan@entry=0x7f4219ed93b0,
> path=path@entry=0x7f4219e89538) at createplan.c:1132
> #2  0x55b439691924 in create_append_plan (root=root@entry=0x55b43affb2b0,
> best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at
> createplan.c:1329
> #3  0x55b43968fa21 in create_plan_recurse (root=root@entry=0x55b43affb2b0,
> best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at
> createplan.c:421
> #4  0x55b43968f974 in create_projection_plan 
> (root=root@entry=0x55b43affb2b0,
> best_path=best_path@entry=0x7f4219ed0f60, flags=flags@entry=1) at
> createplan.c:2039
> #5  0x55b43968fa6f in create_plan_recurse (root=root@entry=0x55b43affb2b0,
> best_path=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:433
> #6  0x55b439690221 in create_plan (root=root@entry=0x55b43affb2b0,
> best_path=) at createplan.c:348
> #7  0x55b4396a1451 in standard_planner (parse=0x55b43af05e28,
> query_string=, cursorOptions=2048, boundParams=0x0) at
> planner.c:413
> #8  0x55b4396a19c1 in planner (parse=parse@entry=0x55b43af05e28,
> query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
> information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
> cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
> at planner.c:277
> #9  0x55b439790c78 in pg_plan_query 
> (querytree=querytree@entry=0x55b43af05e28,
> query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
> information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
> cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
> at postgres.c:883
> #10 0x55b439790d54 in pg_plan_queries (querytrees=0x55b43afdd528,
> query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
> information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
> cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
> at postgres.c:975
> #11 0x55b439791239 in exec_simple_query
> (query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
> information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();")
> at postgres.c:1169
> #12 0x55b439793183 in PostgresMain (dbname=,
> username=) at postgres.c:4542
> #13 0x55b4396e6af7 in BackendRun (port=port@entry=0x55b43af2ffe0) at
> postmaster.c:4489
> #14 0x55b4396e9c03 in BackendStartup (port=port@entry=0x55b43af2ffe0)
> at postmaster.c:4217
> #15 0x55b4396e9e4a in ServerLoop () at postmaster.c:1791
> #16 0x55b4396eb401 in PostmasterMain (argc=7, argv=) at
> postmaster.c:1463
> #17 0x55b43962b4df in main (argc=7, argv=0x55b43aeff0c0) at main.c:202
>
> Actually, the original query failed like this:
> #2  0x55b4398e9f90 in ExceptionalCondition
> (conditionName=conditionName@entry=0x55b439a61238 "plan->scanstatus ==
> SUBQUERY_SCAN_UNKNOWN", errorType=errorType@entry=0x55b43994b00b
> "FailedAssertion",
> #3  0x55b4396a2ecf in trivial_subqueryscan (plan=0x55b43b59cac8) at
> setrefs.c:1367
>

Hi,
I logged the value of plan->scanstatus before the assertion :

2022-04-08 16:20:59.601 UTC [26325] LOG:  scan status 0
2022-04-08 16:20:59.601 UTC [26325] STATEMENT:  explain SELECT 1 FROM
information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
2022-04-08 16:20:59.796 UTC [26296] LOG:  server process (PID 26325) was
terminated by signal 11: Segmentation fault

It seems its value was SUBQUERY_SCAN_UNKNOWN.

Still trying to find out the cause for the crash.


Re: Size of pg_rewrite (Was: Report checkpoint progress with pg_stat_progress_checkpoint)

2022-04-08 Thread Andres Freund
Hi, 

On April 8, 2022 7:52:07 AM PDT, Matthias van de Meent 
 wrote:
>On Sat, 19 Mar 2022 at 01:15, Andres Freund  wrote:
>> pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664
>>
>> pg_rewrite is the second biggest relation in an empty database already...
>
>Yeah, that's not great. Thanks for nerd-sniping me into looking into
>how views and pg_rewrite rules work, that was very interesting and I
>learned quite a lot.

Thanks for looking!


># Immediately potential, limited to progress views
>
>I noticed that the CASE-WHEN (used in translating progress stage index
>to stage names) in those progress reporting views can be more
>efficiently described (althoug with slightly worse behaviour around
>undefined values) using text array lookups (as attached). That
>resulted in somewhat smaller rewrite entries for the progress views
>(toast compression was good old pglz):
>
>template1=# SELECT sum(octet_length(ev_action)),
>SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
>ev_class::regclass::text LIKE '%progress%';
>
>master:
>  sum  |  sum
>---+---
> 97277 | 19956
>patched:
>  sum  |  sum
>---+---
> 77069 | 18417
>
>So this seems like a nice improvement of 20% uncompressed / 7% compressed.
>
>I tested various cases of phase number to text translations: `CASE ..
>WHEN`; `(ARRAY[]::text[])[index]` and `('{}'::text[])[index]`. See
>results below:
>
>postgres=# create or replace view arrayliteral_view as select
>(ARRAY['a','b','c','d','e','f']::text[])[index] as name from tst
>s(index);
>CREATE INDEX
>postgres=# create or replace view stringcast_view as select
>('{a,b,c,d,e,f}'::text[])[index] as name from tst s(index);
>CREATE INDEX
>postgres=# create or replace view split_stringcast_view as select
>(('{a,b,' || 'c,d,e,f}')::text[])[index] as name from tst s(index);
>CREATE VIEW
>postgres=# create or replace view case_view as select case index when
>0 then 'a' when 1 then 'b' when 2 then 'c' when 3 then 'd' when 4 then
>'e' when 5 then 'f' end as name from tst s(index);
>CREATE INDEX
>
>
>postgres=# postgres=# select ev_class::regclass::text,
>octet_length(ev_action), pg_column_size(ev_action) from pg_rewrite
>where ev_class in ('arrayliteral_view'::regclass::oid,
>'case_view'::regclass::oid, 'split_stringcast_view'::regclass::oid,
>'stringcast_view'::regclass::oid);
>   ev_class| octet_length | pg_column_size
>---+--+
> arrayliteral_view | 3311 |   1322
> stringcast_view   | 2610 |   1257
> case_view | 5170 |   1412
> split_stringcast_view | 2847 |   1350
>
>It seems to me that we could consider replacing the CASE statements
>with array literals and lookups if we really value our template
>database size. But, as text literal concatenations don't seem to get
>constant folded before storing them in the rules table, this rewrite
>of the views would result in long lines in the system_views.sql file,
>or we'd have to deal with the additional overhead of the append
>operator and cast nodes.

My inclination is that the mapping functions should be c functions. There's 
really no point in doing it in SQL and it comes at a noticable price. And, if 
done in C, we can fix mistakes in minor releases, which we can't in SQL.


># Future work; nodeToString / readNode, all rewrite rules
>
>Additionally, we might want to consider other changes like default (or
>empty value) elision in nodeToString, if that is considered a
>reasonable option and if we really want to reduce the size of the
>pg_rewrite table.
>
>I think a lot of space can be recovered from that: A manual removal of
>what seemed to be fields with default values (and the removal of all
>query location related fields) in the current definition of
>pg_stat_progress_create_index reduces its uncompressed size from
>23226B raw and 4204B compressed to 13821B raw and 2784B compressed,
>for an on-disk space saving of 33% for this view's ev_action.
>
>Do note, however, that that would add significant branching in the
>nodeToString and readNode code, which might slow down that code
>significantly. I'm not planning on working on that; but in my opinion
>that is a viable path to reducing the size of new database catalogs.

We should definitely be careful about that. I do agree that there's a lot of 
efficiency to be gained in the serialization format. Once we have the automatic 
node func generation in place, we could have one representation for human 
consumption, and one for density...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: remove more archiving overhead

2022-04-08 Thread Nathan Bossart
On Fri, Apr 08, 2022 at 10:20:27AM -0400, Robert Haas wrote:
> On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart  
> wrote:
>> On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
>> > Yes.  I found that a crash at an unfortunate moment can produce multiple
>> > links to the same file in pg_wal, which seemed bad independent of archival.
>> > By fixing that (i.e., switching from durable_rename_excl() to
>> > durable_rename()), we not only avoid this problem, but we also avoid trying
>> > to archive a file the server is concurrently writing.  Then, after a crash,
>> > the WAL file to archive should either not exist (which is handled by the
>> > archiver) or contain the same contents as any preexisting archives.
>>
>> I moved the fix for this to a new thread [0] since I think it should be
>> back-patched.  I've attached a new patch that only contains the part
>> related to reducing archiving overhead.
> 
> While we're now after the feature freeze and thus this will need to
> wait for v16, it looks like a reasonable change to me.

Dang, just missed it.  Thanks for taking a look.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Temporary file access API

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 4:54 AM Antonin Houska  wrote:
> Do you think that the use of a system call is a problem itself (e.g. because
> the code looks less simple if read/write is used somewhere and fread/fwrite
> elsewhere; of course of read/write is mandatory in special cases like WAL,
> heap pages, etc.)  or is the problem that the system calls are used too
> frequently? I suppose only the latter.

I'm not really super-interested in reducing the number of system
calls. It's not a dumb thing in which to be interested and I know that
for example Thomas Munro is very interested in it and has done a bunch
of work in that direction just to improve performance. But for me the
attraction of this is mostly whether it gets us closer to TDE.

And that's why I'm asking these questions about adopting it in
different places. I kind of thought that your previous patches needed
to encrypt, I don't know, 10 or 20 different kinds of files. So I was
surprised to see this patch touching the handling of only 2 kinds of
files. If we consolidate the handling of let's say 15 of 20 cases into
a single mechanism, we've really moved the needle in the right
direction -- but consolidating the handling of 2 of 20 cases, or
whatever the real numbers are, isn't very exciting.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: shared-memory based stats collector

2022-04-08 Thread Andres Freund
Hi, 

On April 8, 2022 4:49:48 AM PDT, Ranier Vilela  wrote:
>Hi,
>
>Per Coverity.
>
>pgstat_reset_entry does not check if lock it was really blocked.
>I think if shared_stat_reset_contents is called without lock,
>is it an issue not?

I don't think so - the nowait parameter is say to false, so the lock 
acquisition is blocking.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Atomic rename feature for Windows.

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 11:45 AM Greg Stark  wrote:
> But that's useful for some things and not for others. Like, it's
> useful to be sure we don't have odd dependencies on timing quirks of
> the specific machines that are currently common, or depend on gcc/llvm
> compiler behaviour that isn't guaranteed. But less so for supporting
> some quirky filesystem behaviour on Windows 8 that newer Windows
> doesn't have and Unix guarantees not to have. (Or supporting non-IEEE
> Vax FP now that we've decided we just don't any more).

Yeah, exactly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 4:47 AM Markus Wanner
 wrote:
> I agree with Michael, it would be nice to not duplicate the code, but
> use a common underlying method.  A modified patch is attached.

I don't think this is better, but I don't think it's worth arguing
about, either, so I'll do it this way if nobody objects.

Meanwhile, I've committed the patch for master to master.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Atomic rename feature for Windows.

2022-04-08 Thread Greg Stark
On Fri, 8 Apr 2022 at 11:30, Robert Haas  wrote:
>
> On Fri, Apr 8, 2022 at 10:12 AM Greg Stark  wrote:
> > On Thu, 9 Dec 2021 at 23:36, Tom Lane  wrote:
> > > I'm not for dropping support for some platform just because it's old.
> >
> > I guess I'll have to spin up the Vax again :)
>
> This is a pretty good summary of what's wrong with our current
> deprecation policy.

I didn't intend it that way but, ok.

> Like Tom, I kind of hate removing support for old
> systems. But I've also come to realize that we often end up supporting
> systems because there's one PostgreSQL developer who has access and
> sets up a buildfarm member ... which tends to mean that we support all
> the stuff that lots of people are using, plus a pretty random subset
> of older systems that do funny things and most people can't access to
> debug any problems that may occur. And that's kind of annoying.

Generally I think supporting older systems that do funny things is
helpful in avoiding problems that either 1) Can happen on newer
systems but rarely 2) Can happen on other systems that people are
using but we don't know about and aren't testing and 3) Can happen on
future systems or future compilers and we might not even find out
about.

But that's useful for some things and not for others. Like, it's
useful to be sure we don't have odd dependencies on timing quirks of
the specific machines that are currently common, or depend on gcc/llvm
compiler behaviour that isn't guaranteed. But less so for supporting
some quirky filesystem behaviour on Windows 8 that newer Windows
doesn't have and Unix guarantees not to have. (Or supporting non-IEEE
Vax FP now that we've decided we just don't any more).

-- 
greg




Re: Commitfest Closed

2022-04-08 Thread Greg Stark
I moved to next CF almost all the Needs Review and Waiting on Author patches.

The remaining ones are either:

1) Bug fixes, Documentation, or testing patches that we may want to
make Open Issues

2) Patches that look like we may want to mark Rejected or Returned
with Feedback and start a new discussion

3) Patches whose email history confused me, such as where multiple
patches are under discussion

I also haven't gone through the Ready for Committer patches yet. I'll
do that at the end of the day.

Incidentally I marked a lot of the Waiting on Author patches as Needs
Review before moving to the next CF because generally I think they
were only Waiting on Author because of the cfbot failures and they
were waiting on design feedback.

Also, as another aside, I find a lot of the patches that haven't been
reviewed were patches that were posted without any specific concerns
or questions. That tends to imply the author thinks the patch is ready
and just waiting on a comprehensive review which is a daunting task.

I would suggest if you're an author posting a WIP and there's some
specific uncertainties that you have about the patch that asking about
them would encourage reviewers to dive in and help you make progress.




Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 11:29 AM Stephen Frost  wrote:
> > +  allow_cred_delegation
> >
> > First, I again recommend not choosing words at random to abbreviate.
> > "delegate_credentials" would be shorter and clearer. Second, I think
> > we need to decide whether we envision just having one parameter here
> > for every kind of credential delegation that libpq might ever support,
> > or whether this is really something specific to GSS. If the latter,
> > the name should mention GSS.
>
> delegate_credentials seems to imply that the server has some kind of
> control over the act of delegating credentials, which isn't really the
> case.  The client has to decide to delegate credentials and it does that
> independent of the server- the server side just gets to either accept
> those delegated credentials, or ignore them.

Oh ... I thought this was a libpq parameter to control the client
behavior. I guess I didn't read it carefully enough.

> Regarding the client side, it is the case that GSSAPIDelegateCredentials
> in ssh defaults to no, so it seems like the next iteration of the patch
> should probably include a libpq option similar to that ssh_config
> option.  As I mentioned before, users already can decide if they'd like
> proxyable credentials or not when they kinit, though more generally this
> is set as a environment-wide policy, but we can add an option and
> disable it by default.

+1.

> This isn't actually something we have a choice in, really, it's from the
> Kerberos library.  MEMORY is the library's in-memory credential cache.
> Other possible values are FILE:/some/file, DIR:/some/dir, API:, and
> others.  Documentaton is available here:
> https://web.mit.edu/kerberos/krb5-1.12/doc/basic/ccache_def.html

Well, I was just going by the fact that this string ("MEMORY:") seems
to be being interpreted in our code, not the library.

> > I wonder whether we really quite this many cases. But if we do they
> > probably need better and more consistent naming.
>
> I wouldn't want to end up with values that could end up conflicting with
> real values that a user might want to specify, so the choice of
> 'environment' and empty-value were specifically chosen to avoid that
> risk.  If we're worried that doing so isn't sufficient or is too
> confusing, the better option would likely be to have another GUC that
> controls if we unset, ignore, or set the value to what the other GUC
> says to set it to.  I'm fine with that if you agree.

Yeah, I thought of that, and it might be the way to go. I wasn't too
sure we needed the explicit-unset behavior as an option, but I defer
to you on that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Size of pg_rewrite

2022-04-08 Thread Matthias van de Meent
On Fri, 8 Apr 2022 at 17:20, Dagfinn Ilmari Mannsåker  wrote:
>
> Matthias van de Meent  writes:
>
> > But, as text literal concatenations don't seem to get constant folded
> > before storing them in the rules table, this rewrite of the views
> > would result in long lines in the system_views.sql file, or we'd have
> > to deal with the additional overhead of the append operator and cast
> > nodes.
>
> There is no need to use the concatenation operator to split array
> constants across multiple lines. Newlines are fine either inside the
> string (between array elements), or between two string string literals
> (which become one string constant at parse time).

Ah, neat, that saves some long lines in the system_views file. I had
already tried the "auto-concatenate two consecutive string literals",
but that try failed in initdb, so now I'm not sure what happened
there.

Thanks!

-Matthias




Re: Atomic rename feature for Windows.

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 10:12 AM Greg Stark  wrote:
> On Thu, 9 Dec 2021 at 23:36, Tom Lane  wrote:
> > I'm not for dropping support for some platform just because it's old.
>
> I guess I'll have to spin up the Vax again :)

This is a pretty good summary of what's wrong with our current
deprecation policy. Like Tom, I kind of hate removing support for old
systems. But I've also come to realize that we often end up supporting
systems because there's one PostgreSQL developer who has access and
sets up a buildfarm member ... which tends to mean that we support all
the stuff that lots of people are using, plus a pretty random subset
of older systems that do funny things and most people can't access to
debug any problems that may occur. And that's kind of annoying.

(I don't have a specific proposal for what to do about it.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-08 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Apr 8, 2022 at 8:21 AM Stephen Frost  wrote:
> > Added an explicit 'environment' option to allow for, basically, existing
> > behavior, where we don't mess with the environment variable at all,
> > though I kept the default as MEMORY since I don't think it's really
> > typical that folks actually want regular user backends to inherit the
> > credential cache of the server.
> >
> > Added a few more tests and updated the documentation too.  Sadly, seems
> > we've missed the deadline for v15 though for lack of feedback on these.
> > Would really like to get some other folks commenting as these are new
> > pg_hba and postgresql.conf options being added.
> 
> I don't think this patch is quite baked enough to go in even if the
> deadline hadn't formally passed, but I'm happy to offer a few opinions
> ... especially if we can also try to sort out a plan for getting that
> wider-checksums thing you mentioned done for v16.

Sure.

>  + /* gssencmode is also libpq option, same to above. */
> + {"gssencmode", UserMappingRelationId, true},
> 
> I really hate names like this that are just a bunch of stuff strung
> together with no punctuation and some arbitrary abbreviations thrown
> in for good measure. But since the libpq parameter already exists it's
> hard to argue we should do anything else here.

Well, yeah.

> +  allow_cred_delegation
> 
> First, I again recommend not choosing words at random to abbreviate.
> "delegate_credentials" would be shorter and clearer. Second, I think
> we need to decide whether we envision just having one parameter here
> for every kind of credential delegation that libpq might ever support,
> or whether this is really something specific to GSS. If the latter,
> the name should mention GSS.

delegate_credentials seems to imply that the server has some kind of
control over the act of delegating credentials, which isn't really the
case.  The client has to decide to delegate credentials and it does that
independent of the server- the server side just gets to either accept
those delegated credentials, or ignore them. 

In terms of having a prefix, this is certainly something that I'd like
to see SSPI support added for as well (perhaps that can be in v16 too)
and so it's definitely not GSS-exclusive among the authentication
methods that we have today.  In that sense, this option falls into the
same category as 'include_realm' and 'krb_realm' in that it applies to
more than one, but not all, of our authentication methods.

> I also suggest that the default value of this option should be false,
> rather than true. I would be unhappy if ssh started defaulting to
> ForwardAgent=yes, because that's less secure and I don't want my
> credentials shared with random servers without me making a choice to
> do that. Similarly here I think we should default to the more secure
> option.

This is a bit backwards from how it works though- this option is about
if the server will accept delegated credentials, not if the client sends
them.  If your client was set to ForwardAgent=yes, would you be happy if
the server's default was AllowAgentForwarding=no?  (At least on the
system I'm looking at, the current default is AllowAgentForwarding=yes
in sshd_config).

Regarding the client side, it is the case that GSSAPIDelegateCredentials
in ssh defaults to no, so it seems like the next iteration of the patch
should probably include a libpq option similar to that ssh_config
option.  As I mentioned before, users already can decide if they'd like
proxyable credentials or not when they kinit, though more generally this
is set as a environment-wide policy, but we can add an option and
disable it by default.

> +  
> +   
> +Sets the location of the Kerberos credential cache to be used for
> +regular user backends which go through authentication.  The default 
> is
> +MEMORY:, which is where delegated credentials
> +are stored (and is otherwise empty).  Care should be used when 
> changing
> +this value- setting it to a file-based credential cache will mean 
> that
> +user backends could potentially use any credentials stored to access
> +other systems.
> +If this parameter is set to an empty string, then the variable will 
> be
> + explicit un-set and the system-dependent default is used, which may be a
> + file-based credential cache with the same caveats as previously
> + mentioned.  If the special value 'environment' is used, then the variable
> + is left untouched and will be whatever was set in the environment at
> + startup time.
> 
> "MEMORY:" seems like a pretty weird choice of arbitrary string. Is it
> supposed to look like a Windows drive letter or pseudo-device, or
> what? I'm not sure exactly what's better here, but I just think this
> doesn't look like anything else we've got today. And then we've got a
> second special environment, "environment", which looks 

Re: Size of pg_rewrite

2022-04-08 Thread Dagfinn Ilmari Mannsåker
Matthias van de Meent  writes:

> But, as text literal concatenations don't seem to get constant folded
> before storing them in the rules table, this rewrite of the views
> would result in long lines in the system_views.sql file, or we'd have
> to deal with the additional overhead of the append operator and cast
> nodes.

There is no need to use the concatenation operator to split array
constants across multiple lines. Newlines are fine either inside the
string (between array elements), or between two string string literals
(which become one string constant at parse time).

https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS

ilmari@[local]:5432 ~=# select
ilmari@[local]:5432 ~-# '{foo,
ilmari@[local]:5432 ~'# bar}'::text[],
ilmari@[local]:5432 ~-# '{bar,'
ilmari@[local]:5432 ~-# 'baz}'::text[];
┌───┬───┐
│   text│   text│
├───┼───┤
│ {foo,bar} │ {bar,baz} │
└───┴───┘
(1 row)


- ilmari




Re: Mingw task for Cirrus CI

2022-04-08 Thread Melih Mutlu
> On windows that makes prep_buildtree go from 42.4s to 5.8s for me.
>

I applied Andres's faster prep build tree changes and triggered some cirrus
runs

Without these changes, preparing build tree was taking around 42.3s
(sometimes even more) [1]
It seems like with these changes it drops to around 8s [2]

[1] https://cirrus-ci.com/task/6562493345562624
[2] https://cirrus-ci.com/task/4836843802853376

Best,
Melih


Re: Commitfest Closed

2022-04-08 Thread Julien Rouhaud
On Fri, Apr 08, 2022 at 08:09:16AM -0700, Peter Geoghegan wrote:
> On Fri, Apr 8, 2022 at 5:58 AM Alvaro Herrera  wrote:
> > Thanks for herding through the CF!
> 
> +1

+1!




Re: [Proposal] vacuumdb --schema only

2022-04-08 Thread Gilles Darold

Le 08/04/2022 à 02:46, Justin Pryzby a écrit :

On Wed, Apr 06, 2022 at 07:43:42PM +0200, Gilles Darold wrote:

Thanks for the review, all these changes are available in new version v6
of the patch and attached here.

This is failing in CI (except on macos, which is strangely passing).
http://cfbot.cputube.org/gilles-darold.html

https://api.cirrus-ci.com/v1/artifact/task/5379693443547136/log/src/bin/scripts/tmp_check/log/regress_log_100_vacuumdb

not ok 59 - vacuumdb --schema "Foo" postgres exit code 0

#   Failed test 'vacuumdb --schema "Foo" postgres exit code 0'
#   at t/100_vacuumdb.pl line 151.
not ok 60 - vacuumdb --schema schema only: SQL found in server log

#   Failed test 'vacuumdb --schema schema only: SQL found in server log'
#   at t/100_vacuumdb.pl line 151.
#   '2022-04-06 18:15:36.313 UTC [34857][not initialized] 
[[unknown]][:0] LOG:  connection received: host=[local]
# 2022-04-06 18:15:36.314 UTC [34857][client backend] [[unknown]][3/2801:0] 
LOG:  connection authorized: user=postgres database=postgres 
application_name=100_vacuumdb.pl
# 2022-04-06 18:15:36.318 UTC [34857][client backend] 
[100_vacuumdb.pl][3/2802:0] LOG:  statement: SELECT 
pg_catalog.set_config('search_path', '', false);
# 2022-04-06 18:15:36.586 UTC [34857][client backend] [100_vacuumdb.pl][:0] 
LOG:  disconnection: session time: 0:00:00.273 user=postgres database=postgres 
host=[local]
# '
# doesn't match '(?^:VACUUM "Foo".bar)'



Ok, got it with the help of rjuju. Actually it was compiling well using 
gcc but clang give some warnings. A fix of these warning makes CI happy.



Attached v7 of the patch that should pass cfbot.

--
Gilles Darold
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 956c0f01cb..0de001ef24 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -39,6 +39,40 @@ PostgreSQL documentation
dbname
   
 
+  
+   vacuumdb
+   connection-option
+   option
+
+   
+
+ 
+   
+
+ 
+  -n
+  --schema
+ 
+ schema
+
+   
+
+   
+
+ 
+  -N
+  --exclude-schema
+ 
+ schema
+
+   
+ 
+
+   
+
+   dbname
+  
+
   
vacuumdb
connection-option
@@ -244,6 +278,28 @@ PostgreSQL documentation
   
  
 
+ 
+  -n schema
+  --schema=schema
+  
+   
+Clean or analyze all tables in schema only.
+Multiple schemas can be vacuumed by writing multiple -n switches.
+   
+  
+ 
+
+ 
+  -N schema
+  --exclude-schema=schema
+  
+   
+Clean or analyze all tables NOT in schema.
+Multiple schemas can be excluded from the vacuum by writing multiple -N switches.
+   
+  
+ 
+
  
   --no-index-cleanup
   
@@ -619,6 +675,14 @@ PostgreSQL documentation
 $ vacuumdb --analyze --verbose --table='foo(bar)' xyzzy
 
 
+   
+To clean all tables in the foo and bar schemas
+only in a database named xyzzy:
+
+$ vacuumdb --schema='foo' --schema='bar' xyzzy
+
+
+
  
 
  
diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl
index 96a818a3c1..7bbfb97246 100644
--- a/src/bin/scripts/t/100_vacuumdb.pl
+++ b/src/bin/scripts/t/100_vacuumdb.pl
@@ -103,6 +103,8 @@ $node->safe_psql(
   CREATE TABLE funcidx (x int);
   INSERT INTO funcidx VALUES (0),(1),(2),(3);
   CREATE INDEX i0 ON funcidx ((f1(x)));
+  CREATE SCHEMA "Foo";
+  CREATE TABLE "Foo".bar(id int);
 |);
 $node->command_ok([qw|vacuumdb -Z --table="need""q(uot"(")x") postgres|],
 	'column list');
@@ -146,5 +148,15 @@ $node->issues_sql_like(
 	[ 'vacuumdb', '--min-xid-age', '2147483001', 'postgres' ],
 	qr/GREATEST.*relfrozenxid.*2147483001/,
 	'vacuumdb --table --min-xid-age');
+$node->issues_sql_like(
+	[ 'vacuumdb', '--schema', '"Foo"', 'postgres' ],
+	qr/VACUUM "Foo".bar/,
+	'vacuumdb --schema schema only');
+$node->command_fails(
+	[ 'vacuumdb',   '-n', 'pg_catalog', '-t', 'pg_class', 'postgres' ],
+	'cannot vacuum all tables in schema(s) and specific table(s) at the same time');
+$node->command_fails(
+	[ 'vacuumdb',   '-n', 'pg_catalog', '-N',  '"Foo"', 'postgres' ],
+	'cannot use option -n | --schema and -N | --exclude-schema at the same time');
 
 done_testing();
diff --git a/src/bin/scripts/t/101_vacuumdb_all.pl b/src/bin/scripts/t/101_vacuumdb_all.pl
index 1dcf411767..b122c995b1 100644
--- a/src/bin/scripts/t/101_vacuumdb_all.pl
+++ b/src/bin/scripts/t/101_vacuumdb_all.pl
@@ -15,5 +15,8 @@ $node->issues_sql_like(
 	[ 'vacuumdb', '-a' ],
 	qr/statement: VACUUM.*statement: VACUUM/s,
 	'vacuum all databases');
+$node->command_fails(
+	[ 'vacuumdb', '-a',  '-n', 'pg_catalog' ],
+	'cannot vacuum specific schema(s) in all databases');
 
 done_testing();
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 4f6917fd39..f118b05169 100644
--- a/src/bin/scripts/vacuumdb.c
+++ 

Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 10:45 AM Thom Brown  wrote:
> Thanks. This doesn't include my self-correction:
>
> s/kept on standby/kept on the standby/

Here is v2, endeavoring to rectify that oversight.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-docs-Note-the-recovery_min_apply_delay-bloats-pg_.patch
Description: Binary data


Re: Commitfest Closed

2022-04-08 Thread Peter Geoghegan
On Fri, Apr 8, 2022 at 5:58 AM Alvaro Herrera  wrote:
> Thanks for herding through the CF!

+1

-- 
Peter Geoghegan




Re: Mark all GUC variable as PGDLLIMPORT

2022-04-08 Thread Julien Rouhaud
On Fri, Apr 08, 2022 at 03:04:18PM +0200, Magnus Hagander wrote:
> On Fri, Apr 8, 2022 at 2:42 PM Robert Haas  wrote:
> 
> > On Wed, Apr 6, 2022 at 7:56 PM Michael Paquier 
> > wrote:
> > > On Wed, Apr 06, 2022 at 12:57:29AM +0700, John Naylor wrote:
> > > > For these two patches, I'd say a day or two after feature freeze is a
> > > > reasonable goal.
> > >
> > > Yeah.  For patches as invasive as the PGDLLIMPORT business and the
> > > frontend error refactoring, I am also fine to have two exceptions with
> > > the freeze deadline.
> >
> > Done now.
> >
> 
> \o/

Woohoo!  Thanks a lot!




Re: Mingw task for Cirrus CI

2022-04-08 Thread Alvaro Herrera
On 2022-Apr-07, Andres Freund wrote:

> Since dash won't help us to get the build time down sufficiently, and the
> tests don't pass without a separate build tree, I looked at what makes
> config/prep_buildtree so slow.

Maybe we can replace prep_buildtree with a Perl script.  Surely that
should be faster.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)




Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 8:21 AM Stephen Frost  wrote:
> Added an explicit 'environment' option to allow for, basically, existing
> behavior, where we don't mess with the environment variable at all,
> though I kept the default as MEMORY since I don't think it's really
> typical that folks actually want regular user backends to inherit the
> credential cache of the server.
>
> Added a few more tests and updated the documentation too.  Sadly, seems
> we've missed the deadline for v15 though for lack of feedback on these.
> Would really like to get some other folks commenting as these are new
> pg_hba and postgresql.conf options being added.

Hi,

I don't think this patch is quite baked enough to go in even if the
deadline hadn't formally passed, but I'm happy to offer a few opinions
... especially if we can also try to sort out a plan for getting that
wider-checksums thing you mentioned done for v16.

 + /* gssencmode is also libpq option, same to above. */
+ {"gssencmode", UserMappingRelationId, true},

I really hate names like this that are just a bunch of stuff strung
together with no punctuation and some arbitrary abbreviations thrown
in for good measure. But since the libpq parameter already exists it's
hard to argue we should do anything else here.

+  allow_cred_delegation

First, I again recommend not choosing words at random to abbreviate.
"delegate_credentials" would be shorter and clearer. Second, I think
we need to decide whether we envision just having one parameter here
for every kind of credential delegation that libpq might ever support,
or whether this is really something specific to GSS. If the latter,
the name should mention GSS.

I also suggest that the default value of this option should be false,
rather than true. I would be unhappy if ssh started defaulting to
ForwardAgent=yes, because that's less secure and I don't want my
credentials shared with random servers without me making a choice to
do that. Similarly here I think we should default to the more secure
option.

+  
+   
+Sets the location of the Kerberos credential cache to be used for
+regular user backends which go through authentication.  The default is
+MEMORY:, which is where delegated credentials
+are stored (and is otherwise empty).  Care should be used when changing
+this value- setting it to a file-based credential cache will mean that
+user backends could potentially use any credentials stored to access
+other systems.
+If this parameter is set to an empty string, then the variable will be
+ explicit un-set and the system-dependent default is used, which may be a
+ file-based credential cache with the same caveats as previously
+ mentioned.  If the special value 'environment' is used, then the variable
+ is left untouched and will be whatever was set in the environment at
+ startup time.

"MEMORY:" seems like a pretty weird choice of arbitrary string. Is it
supposed to look like a Windows drive letter or pseudo-device, or
what? I'm not sure exactly what's better here, but I just think this
doesn't look like anything else we've got today. And then we've got a
second special environment, "environment", which looks completely
different: now it's lower-case and without the colon. And then empty
string is special too.

I wonder whether we really quite this many cases. But if we do they
probably need better and more consistent naming.

The formatting here also looks weird.

+#ifndef PG_KRB_USER_CCACHE
+#define PG_KRB_USER_CCACHE "MEMORY:"
+#endif

At the risk of stating the obvious, the general idea of a #define is
that you define things in one place and then use the defined symbol
rather than the original value everywhere. This patch takes the
less-useful approach of defining two different symbols for the same
string in different files. This one has this #ifndef/#endif guard here
which I think it probably shouldn't, since the choice of string
probably shouldn't be compile-time configurable, but it also won't
work, because there's no similar guard in the other file.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Mingw task for Cirrus CI

2022-04-08 Thread Melih Mutlu
Hi Andrew,

You should set MSYSTEM=UCRT64 in the environment section. Given that,
> there should be no need to specify a --host= setting for configure.
>

It's set to UCRT64 in the docker image side [1]. I didn't know --host isn't
necessary on UCRT64 environment. I'll remove it then.

 [1]
 
https://github.com/anarazel/pg-vm-images/blob/main/docker/windows_ci_mingw64#L11


Best,
Melih


Size of pg_rewrite (Was: Report checkpoint progress with pg_stat_progress_checkpoint)

2022-04-08 Thread Matthias van de Meent
On Sat, 19 Mar 2022 at 01:15, Andres Freund  wrote:
> pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664
>
> pg_rewrite is the second biggest relation in an empty database already...

Yeah, that's not great. Thanks for nerd-sniping me into looking into
how views and pg_rewrite rules work, that was very interesting and I
learned quite a lot.

# Immediately potential, limited to progress views

I noticed that the CASE-WHEN (used in translating progress stage index
to stage names) in those progress reporting views can be more
efficiently described (althoug with slightly worse behaviour around
undefined values) using text array lookups (as attached). That
resulted in somewhat smaller rewrite entries for the progress views
(toast compression was good old pglz):

template1=# SELECT sum(octet_length(ev_action)),
SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
ev_class::regclass::text LIKE '%progress%';

master:
  sum  |  sum
---+---
 97277 | 19956
patched:
  sum  |  sum
---+---
 77069 | 18417

So this seems like a nice improvement of 20% uncompressed / 7% compressed.

I tested various cases of phase number to text translations: `CASE ..
WHEN`; `(ARRAY[]::text[])[index]` and `('{}'::text[])[index]`. See
results below:

postgres=# create or replace view arrayliteral_view as select
(ARRAY['a','b','c','d','e','f']::text[])[index] as name from tst
s(index);
CREATE INDEX
postgres=# create or replace view stringcast_view as select
('{a,b,c,d,e,f}'::text[])[index] as name from tst s(index);
CREATE INDEX
postgres=# create or replace view split_stringcast_view as select
(('{a,b,' || 'c,d,e,f}')::text[])[index] as name from tst s(index);
CREATE VIEW
postgres=# create or replace view case_view as select case index when
0 then 'a' when 1 then 'b' when 2 then 'c' when 3 then 'd' when 4 then
'e' when 5 then 'f' end as name from tst s(index);
CREATE INDEX


postgres=# postgres=# select ev_class::regclass::text,
octet_length(ev_action), pg_column_size(ev_action) from pg_rewrite
where ev_class in ('arrayliteral_view'::regclass::oid,
'case_view'::regclass::oid, 'split_stringcast_view'::regclass::oid,
'stringcast_view'::regclass::oid);
   ev_class| octet_length | pg_column_size
---+--+
 arrayliteral_view | 3311 |   1322
 stringcast_view   | 2610 |   1257
 case_view | 5170 |   1412
 split_stringcast_view | 2847 |   1350

It seems to me that we could consider replacing the CASE statements
with array literals and lookups if we really value our template
database size. But, as text literal concatenations don't seem to get
constant folded before storing them in the rules table, this rewrite
of the views would result in long lines in the system_views.sql file,
or we'd have to deal with the additional overhead of the append
operator and cast nodes.

# Future work; nodeToString / readNode, all rewrite rules

Additionally, we might want to consider other changes like default (or
empty value) elision in nodeToString, if that is considered a
reasonable option and if we really want to reduce the size of the
pg_rewrite table.

I think a lot of space can be recovered from that: A manual removal of
what seemed to be fields with default values (and the removal of all
query location related fields) in the current definition of
pg_stat_progress_create_index reduces its uncompressed size from
23226B raw and 4204B compressed to 13821B raw and 2784B compressed,
for an on-disk space saving of 33% for this view's ev_action.

Do note, however, that that would add significant branching in the
nodeToString and readNode code, which might slow down that code
significantly. I'm not planning on working on that; but in my opinion
that is a viable path to reducing the size of new database catalogs.


-Matthias

PS. attached patch is not to be considered complete - it is a minimal
example of the array literal form. It fails regression tests because I
didn't bother updating or including the regression tests on system
views.
Index: src/backend/catalog/system_views.sql
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
--- a/src/backend/catalog/system_views.sql  (revision 
32723e5fabcc7db1bf4e897baaf0d251b500c1dc)
+++ b/src/backend/catalog/system_views.sql  (date 1649160138886)
@@ -1120,13 +1120,7 @@
 SELECT
 S.pid AS pid, S.datid AS datid, D.datname AS datname,
 CAST(S.relid AS oid) AS relid,
-CASE S.param1 WHEN 0 THEN 'initializing'
-  WHEN 1 THEN 'acquiring sample rows'
-  WHEN 2 THEN 'acquiring inherited sample rows'
-  WHEN 3 THEN 'computing statistics'
-  WHEN 4 THEN 

Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-08 Thread Thom Brown
On Fri, 8 Apr 2022, 14:36 Robert Haas,  wrote:

> On Wed, Apr 6, 2022 at 8:15 AM Robert Haas  wrote:
> > On Tue, Apr 5, 2022 at 8:43 PM Thom Brown  wrote:
> > > I share your discomfort with the wording.  How about:
> > >
> > > WAL records must be kept on standby until they are ready to be applied.
> > > Therefore, longer delays will result in a greater accumulation of WAL
> files,
> > > increasing disk space requirements for the standby's
> pg_wal
> > > directory.
> >
> > Looks awesome.
>
> Here that is in patch form. I feel that the feature freeze should not
> preclude committing this documentation improvement, but if someone
> feels otherwise, then I will leave this until the tree reopens.
>

Thanks. This doesn't include my self-correction:

s/kept on standby/kept on the standby/

Thom

>


Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Robert Haas
On Thu, Apr 7, 2022 at 2:30 PM Nathan Bossart  wrote:
> Presently, WAL recycling uses durable_rename_excl(), which notes that a
> crash at an unfortunate moment can result in two links to the same file.
> My testing [1] demonstrated that it was possible to end up with two links
> to the same file in pg_wal after a crash just before unlink() during WAL
> recycling.  Specifically, the test produced links to the same file for the
> current WAL file and the next one because the half-recycled WAL file was
> re-recycled upon restarting.  This seems likely to lead to WAL corruption.

Wow, that's bad.

> The attached patch prevents this problem by using durable_rename() instead
> of durable_rename_excl() for WAL recycling.  This removes the protection
> against accidentally overwriting an existing WAL file, but there shouldn't
> be one.

I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.

I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.

Your proposed fix is OK if we don't want to do any of that stuff, but
personally I'm much more inclined to blame durable_rename_excl() for
being horrible than I am to blame the calling code for using it
improvidently.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: REINDEX blocks virtually any queries but some prepared queries.

2022-04-08 Thread Frédéric Yhuel




On 4/8/22 02:22, Michael Paquier wrote:

On Thu, Apr 07, 2022 at 05:29:36PM +0200, Guillaume Lelarge a écrit :

Le jeu. 7 avr. 2022 à 15:44, Frédéric Yhuel  a écrit 
:

On 4/7/22 14:40, Justin Pryzby wrote:
Thank you Justin! I applied your fixes in the v2 patch (attached).


v2 patch sounds good.


The location of the new sentence and its wording seem fine to me.  So
no objections from me to add what's suggested, as suggested.  I'll
wait for a couple of days first.



Thank you Michael.


Indeed ;) That being said, REINDEX CONCURRENTLY could give you an
invalid index, so sometimes you may be tempted to go for a simpler
REINDEX, especially if you believe that the SELECTs won't be blocked.


Agreed.


There are many factors to take into account, one is more expensive
than the other in terms of resources and has downsides, downsides
compensated by the reduction in the lock requirements.  There are
cases where REINDEX is a must-have, as CONCURRENTLY does not support
catalog indexes, and these tend to be easily noticed when corruption
spreads around.


Indeed!

Best regards,
Frédéric




Re: remove more archiving overhead

2022-04-08 Thread Robert Haas
On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart  wrote:
> On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
> > Yes.  I found that a crash at an unfortunate moment can produce multiple
> > links to the same file in pg_wal, which seemed bad independent of archival.
> > By fixing that (i.e., switching from durable_rename_excl() to
> > durable_rename()), we not only avoid this problem, but we also avoid trying
> > to archive a file the server is concurrently writing.  Then, after a crash,
> > the WAL file to archive should either not exist (which is handled by the
> > archiver) or contain the same contents as any preexisting archives.
>
> I moved the fix for this to a new thread [0] since I think it should be
> back-patched.  I've attached a new patch that only contains the part
> related to reducing archiving overhead.

While we're now after the feature freeze and thus this will need to
wait for v16, it looks like a reasonable change to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




GSOC: New and improved website for pgjdbc (JDBC) (2022)

2022-04-08 Thread S.R Keshav



postgreSql_ New and improved website for pgjdbc (JDBC) (2022).pdf
Description: Adobe PDF document


Re: Atomic rename feature for Windows.

2022-04-08 Thread Greg Stark
On Thu, 9 Dec 2021 at 23:36, Tom Lane  wrote:
>
> I'm not for dropping support for some platform just because it's old.

I guess I'll have to spin up the Vax again :)




Re: why pg_walfile_name() cannot be executed during recovery?

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 9:31 AM Bharath Rupireddy
 wrote:
> Fundamental question - should the pg_walfile_{name, name_offset} check
> whether the file with the computed WAL file name exists on the server
> right now or ever existed earlier? Right now, they don't do that, see
> [1].

I don't think that checking whether the file exists is the right
approach. However, I do think that it's important to be precise about
which TLI is going to be used. I think it would be reasonable to
redefine this function (on both the primary and the standby) so that
the TLI that is used is the one that was in effect at the time record
at the given LSN was either written or replayed. Then, you could
potentially use this function to figure out whether you still have the
WAL files that are needed to replay up to some previous point in the
WAL stream. However, what about the segments where we switched from
one TLI to the next in the middle of the segment? There, you probably
need both the old and the new segments, or maybe if you're trying to
stream them you only need the new one because we have some weird
special case that will send the segment from the new timeline when the
segment from the old timeline is requested. So you couldn't just call
this function on one LSN per segment and call it good, and it wouldn't
necessarily be the case that the filenames you got back were exactly
the ones you needed.

So I'm not entirely sure this proposal is good enough, but it at least
would have the advantage of meaning that the filename you get back is
one that existed at some point in time and somebody used it for
something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Can we automatically add elapsed times to tap test log?

2022-04-08 Thread Andrew Dunstan

On 4/7/22 19:55, Andrew Dunstan wrote:
> On 4/7/22 17:58, Andres Freund wrote:
>> Hi,
>>
>> On 2022-04-07 17:45:09 -0400, Tom Lane wrote:
>>> Andres Freund  writes:
 On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
> I too think that the elapsed time is useful.  I'm less convinced
> that the time-of-day marker is useful.
 I think it'd be quite useful if it had more precision - it's a pita to
 correlate regress_log_* output with server logs.
>>> Fair point.  Maybe we could keep the timestamp (with ms precision
>>> if possible) and then the parenthetical bit is time-since-last-line
>>> (also with ms precision)?  I think that would more or less satisfy
>>> both uses.
>> Would work for me...
>>
> All doable. Time::HiRes gives us a higher resolution timer. I'll post a
> new version in a day or two.


New version attached.


Sample traces:


andrew@emma:log $ egrep '^\[[0-9][0-9]:[00-9][0-9]:' 
regress_log_020_pg_receivewal | tail -n 15
[09:22:45.031](0.000s) ok 30 # skip postgres was not built with LZ4 support
[09:22:45.032](0.000s) ok 31 # skip postgres was not built with LZ4 support
[09:22:45.296](0.265s) ok 32 - streaming some WAL
[09:22:45.297](0.001s) ok 33 - check that previously partial WAL is now complete
[09:22:45.298](0.001s) ok 34 - check stream dir permissions
[09:22:45.298](0.000s) # Testing pg_receivewal with slot as starting streaming 
point
[09:22:45.582](0.284s) ok 35 - pg_receivewal fails with non-existing slot: exit 
code not 0
[09:22:45.583](0.001s) ok 36 - pg_receivewal fails with non-existing slot: 
matches
[09:22:45.618](0.036s) ok 37 - WAL streamed from the slot's restart_lsn
[09:22:45.619](0.001s) ok 38 - WAL from the slot's restart_lsn has been archived
[09:22:46.597](0.978s) ok 39 - Stream some wal after promoting, resuming from 
the slot's position
[09:22:46.598](0.001s) ok 40 - WAL segment 0001000B archived 
after timeline jump
[09:22:46.598](0.000s) ok 41 - WAL segment 0002000C archived 
after timeline jump
[09:22:46.598](0.000s) ok 42 - timeline history file archived after timeline 
jump
[09:22:46.599](0.001s) 1..42


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/perl/PostgreSQL/Test/SimpleTee.pm b/src/test/perl/PostgreSQL/Test/SimpleTee.pm
index bb9d79a755..d92c31a891 100644
--- a/src/test/perl/PostgreSQL/Test/SimpleTee.pm
+++ b/src/test/perl/PostgreSQL/Test/SimpleTee.pm
@@ -10,10 +10,31 @@
 # method is currently implemented; that's all we need. We don't want to
 # depend on IO::Tee just for this.
 
+# The package is enhanced to add timestamp and elapsed time decorations to
+# the log file traces sent through this interface from Test::More.
+
 package PostgreSQL::Test::SimpleTee;
 use strict;
 use warnings;
 
+use Time::HiRes qw(time);
+
+my $last_time;
+
+BEGIN { $last_time = time; }
+
+sub _time_str
+{
+	my $tm = time;
+	my $diff = $tm - $last_time;
+	$last_time = $tm;
+my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) =
+  localtime($tm);
+	my $usec = int(1000 * ($tm - int($tm)));
+return sprintf("[%.2d:%.2d:%.2d.%.3d](%.3fs) ",
+   $hour, $min, $sec, $usec, $diff);
+}
+
 sub TIEHANDLE
 {
 	my $self = shift;
@@ -24,10 +45,16 @@ sub PRINT
 {
 	my $self = shift;
 	my $ok   = 1;
+	# The first file argument passed to tiehandle in PostgreSQL::Test::Utils is
+	# the original stdout, which is what PROVE sees. Additional decorations
+	# confuse it, so only put out the time string on files after the first.
+	my $skip = 1;
+	my $ts = _time_str;
 	for my $fh (@$self)
 	{
-		print $fh @_ or $ok = 0;
+		print $fh ($skip ? "" : $ts), @_ or $ok = 0;
 		$fh->flush   or $ok = 0;
+		$skip = 0;
 	}
 	return $ok;
 }


Re: GSoC: New and improved website for pgjdbc (JDBC)

2022-04-08 Thread Dave Cramer
Joseph



On Thu, 7 Apr 2022 at 17:49, Joseph Ho  wrote:

> Hello,
>
> I am Joseph Ho, a senior at Dr Norman Bethune Collegiate Institute
> interested in going into computer science. I am interested in working to
> create and improve the website for pgjdbc during GSoC 2022.
>
> I am wondering how the draft proposal should be made. Will I need to
> submit a web design of the new and improved website or will I need to
> submit something else? Also, am I able to use a web framework of my choice
> or is there one that you prefer that we use?
>

You should register on the GSoC site Contributor Registration | Google
Summer of Code 

The draft proposal can just be a document at this point which outlines your
ideas

Currently the site is built using Jekyll, and I see no good reason to
change. What I am looking for is to update the version of Jekyll to latest
and make the site cleaner with a new design.

You should be aware that proposals have to be in by the 19th of April


Regards,

Dave

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-08 Thread Magnus Hagander
On Fri, Apr 8, 2022 at 3:36 PM Robert Haas  wrote:

> On Wed, Apr 6, 2022 at 8:15 AM Robert Haas  wrote:
> > On Tue, Apr 5, 2022 at 8:43 PM Thom Brown  wrote:
> > > I share your discomfort with the wording.  How about:
> > >
> > > WAL records must be kept on standby until they are ready to be applied.
> > > Therefore, longer delays will result in a greater accumulation of WAL
> files,
> > > increasing disk space requirements for the standby's
> pg_wal
> > > directory.
> >
> > Looks awesome.
>
> Here that is in patch form. I feel that the feature freeze should not
> preclude committing this documentation improvement, but if someone
> feels otherwise, then I will leave this until the tree reopens.
>

We normally allow documentation and bug fixes after the feature freeze.
(It's only in the "we're about to wrap the release right now"-freeze that
we have to avoid those)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?

2022-04-08 Thread Bharath Rupireddy
On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat
 wrote:
>
> On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > I'm thinking if there's a way in core postgres to achieve $subject. In
> > reality, the sync/async standbys can either be closer/farther (which
> > means sync/async standbys can receive WAL at different times) to
> > primary, especially in cloud HA environments with primary in one
> > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > $subject may not be possible on dev systems (say, for testing some HA
> > features) unless we can inject a delay in WAL senders before sending
> > WAL.
> >
> > How about having two developer-only GUCs {async,
> > sync}_wal_sender_delay? When set, the async and sync WAL senders will
> > delay sending WAL by {async, sync}_wal_sender_delay
> > milliseconds/seconds? Although, I can't think of any immediate use, it
> > will be useful someday IMO, say for features like [1], if it gets in.
> > With this set of GUCs, one can even add core regression tests for HA
> > features.
> >
> > Thoughts?
>
> I think this is a common problem, people run into. Once way to
> simulate network delay is what you suggest, yes. But I was wondering
> if there are tools/libraries that can help us to do that. Googling
> gives OS specific tools but nothing like a C or perl library which can
> be used for this purpose.

Thanks. IMO, non-postgres tools (not sure if they exist, if at all
they exist) to simulate network delays may not be reliable and usable
easily, say, for adding some TAP tests for HA features. Especially in
the cloud-world usage of those external tools may not even be
possible. With the developer-only GUCs as being proposed here in this
thread, it's pretty much easy to simulate what we want, but only the
extra caution is to not let others (probably non-superusers) set and
misuse these developer-only GUCs. I think that's even true for all the
existing developer-only GUCs.

Thoughts?

Regards,
Bharath Rupireddy.




Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread Stephen Frost
Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Fri, Apr 8, 2022 at 2:19 PM David Rowley  wrote:
> > On Fri, 8 Apr 2022 at 23:27, Magnus Hagander  wrote:
> > > On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander 
> > wrote:
> > >>
> > >> On Tue, Mar 29, 2022 at 10:06 PM David Rowley 
> > wrote:
> > >>>
> > >>> If we go with this patch,  the problem I see here is that the amount
> > >>> of work the JIT compiler must do for a given query depends mostly on
> > >>> the number of expressions that must be compiled in the query (also to
> > >>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
> > >>> jit_tuple_deforming and jit_expressions). The DBA does not really have
> > >>> much control over the number of expressions in the query.  All he or
> > >>> she can do to get rid of the warning is something like increase
> > >>> jit_above_cost.  After a few iterations of that, the end result is
> > >>> that jit_above_cost is now high enough that JIT no longer triggers
> > >>> for, say, that query to that table with 1000 partitions where no
> > >>> plan-time pruning takes place.  Is that really a good thing? It likely
> > >>> means that we just rarely JIT anything at all!
> > >>
> > >>
> > >> I don't agree with the conclusion of that.
> > >>
> > >> What the parameter would be useful for is to be able to tune those
> > costs (or just turn it off) *for that individual query*. That doesn't mean
> > you "rarely JIT anything atll", it just means you rarely JIT that
> > particular query.
> >
> > I just struggle to imagine that anyone is going to spend much effort
> > tuning a warning parameter per query.  I imagine they're far more
> > likely to just ramp it up to only catch some high percentile problems
> > or just (more likely) just not bother with it.  It seems more likely
> > that if anyone was to tune anything per query here it would be
> > jit_above_cost, since that actually might have an affect on the
> > performance of the query, rather than if it spits out some warning
> > message or not.  ISTM that if the user knows what to set it to per
> > query, then there's very little point in having a warning as we'd be
> > alerting them to something they already know about.
> 
> I would not expect people to tune the *warning* at a query level. If
> anything, then ys, they would tune the either jit_above_cost or just
> jit=off. But the idea being you can do that on a per query level instead of
> globally.

Yeah, exactly, this is about having a busy system and wanting to know
which queries are spending a lot of time doing JIT relative to the query
time, so that you can go adjust your JIT parameters or possibly disable
JIT for those queries (or maybe bring those cases to -hackers and try to
help make our costing better).

> > I looked in the -general list to see if we could get some common
> > explanations to give us an idea of the most common reason for high JIT
> > compilation time. It seems that the plans were never simple. [1] seems
> > due to a complex plan. I'm basing that off the "Functions: 167". I
> > didn't catch the full plan. From what I can tell, [2] seems to be due
> > to "lots of empty tables", so assuming the clamping at 1 page is
> > causing issues there.  I think both of those cases could be resolved
> > by building the costing the way I mentioned.  I admit that 2 cases is
> > not a very large sample size.
> 
> Again, I am very much for improvements of the costing model. This is in no
> way intended to be a replacement for that. It's intended to be a stop-gap.

Not sure I'd say it's a 'stop-gap' as it's really very similar, imv
anyway, to log_min_duration_statement- you want to know what queries are
taking a lot of time but you can't log all of them.

> What I see much of today are things like
> https://dba.stackexchange.com/questions/264955/handling-performance-problems-with-jit-in-postgres-12
> or
> https://dev.to/xenatisch/cascade-of-doom-jit-and-how-a-postgres-update-led-to-70-failure-on-a-critical-national-service-3f2a
> 
> The bottom line is that people end up with recommendations to turn off JIT
> globally more or less by default. Because there's no real useful way today
> to figure out when it causes problems vs when it helps.

Yeah, that's frustrating.

> The addition to pg_stat_statements I pushed a short while ago would help
> with that. But I think having a warning like this would also be useful. As
> a stop-gap measure, yes, but we really don't know when we will have an
> improved costing model for it. I hope you're right and that we can have it
> by 16, and then I will definitely advocate for removing the warning again
> if it works.

Having this in pg_stat_statements is certainly helpful but having a
warning also is.  I don't think we have to address this in only one way.
A lot faster to flip this guc and then look in the logs on a busy system
than to install pg_stat_statements, restart the cluster once you get
permission to do so, and then query it.

Thanks,

Stephen

Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-08 Thread Robert Haas
On Wed, Apr 6, 2022 at 8:15 AM Robert Haas  wrote:
> On Tue, Apr 5, 2022 at 8:43 PM Thom Brown  wrote:
> > I share your discomfort with the wording.  How about:
> >
> > WAL records must be kept on standby until they are ready to be applied.
> > Therefore, longer delays will result in a greater accumulation of WAL files,
> > increasing disk space requirements for the standby's pg_wal
> > directory.
>
> Looks awesome.

Here that is in patch form. I feel that the feature freeze should not
preclude committing this documentation improvement, but if someone
feels otherwise, then I will leave this until the tree reopens.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-docs-Note-the-recovery_min_apply_delay-bloats-pg_.patch
Description: Binary data


Re: Support for grabbing multiple consecutive values with nextval()

2022-04-08 Thread Greg Stark
On Sun, 27 Feb 2022 at 07:09, Jille Timmermans  wrote:
>
> Hi,
>
> First time PostgreSQL contributor here :)

I wish I had noticed this patch during the CF. It seems like a nice
self-contained feature that could have been easily reviewed and
committed and it's always good to see first-time contributions.
Hopefully it'll get committed early in the next cycle.


-- 
greg




Re: why pg_walfile_name() cannot be executed during recovery?

2022-04-08 Thread Bharath Rupireddy
On Thu, Apr 7, 2022 at 9:07 PM Robert Haas  wrote:
>
> On Thu, Apr 7, 2022 at 9:32 AM Bharath Rupireddy
>  wrote:
> > I spent some time today to allow pg_walfile_{name, name_offset} run in
> > recovery. Timeline ID is computed while in recovery as follows - WAL
> > receiver's last received and flushed WAL record's TLI if it's
> > streaming, otherwise the last replayed WAL record's TLI. This way,
> > these functions can be used on standby or PITR server or even in crash
> > recovery if the server opens up for read-only connections.
>
> I don't think this is a good definition. Suppose I ask for
> pg_walfile_name() using an older LSN. With this approach, we're going
> to get a filename based on the idea that the TLI that was in effect
> back then is the same one as the TLI that is in effect now, which
> might not be true. For example, suppose that the current TLI is 2 and
> it branched off of timeline 1 at 10/0. If I ask for
> pg_walfile_name('F/0'), it's going to give me the name of a WAL file
> that has never existed. That seems bad.
>
> It's also worth noting that there's a bit of a definitional problem
> here. If in the same situation, I ask for pg_walfile_name('11/0'),
> it's going to give me a filename based on TLI 2, but there's also a
> WAL file for that LSN with TLI 1. How do we know which one the user
> wants? Perhaps one idea would be to say that the relevant TLI is the
> one which was in effect at the time that LSN was replayed. If we do
> that, what about future LSNs? We could assume that for future LSNs,
> the TLI should be the same as the current TLI, but maybe that's also
> misleading, because recovery_target_timeline could be set.

Fundamental question - should the pg_walfile_{name, name_offset} check
whether the file with the computed WAL file name exists on the server
right now or ever existed earlier? Right now, they don't do that, see
[1].

I think we can make the functions more robust:
pg_walfile_{name, name_offset}(lsn, check_if_file_exists = false, tli
= invalid_timelineid) - when check_if_file_exists is true checks for
the computed WAL file existence and when a valid tli is provided uses
it in computing the WAL file name. When tli isn't provided, it
continues to use insert tli for primary, and in recovery it uses tli
as proposed in my patch. Perhaps, it can also do (as Michael
suggested) this - if check_if_file_exists is true and tli isn't
provided and there's timeline history, then it can go look at all the
timelines and whether the file exists with the computed name with
history tli.

> I think it's really important to start by being precise about the
> question that we think pg_walfile_name() ought to be answering. If we
> don't know that, then we really can't say what TLI it should be using.
> It's not hard to make the function return SOME answer using SOME TLI,
> but then it's not clear that the answer is the right one for any
> particular purpose. And in that case the function is more dangerous
> than useful, because people will write code that uses it to do stuff,
> and then that stuff won't actually work correctly under all
> circumstances.

Yes, once we agree on the semantics of these functions, having better
documentation will help.

Thoughts?

[1]
postgres=# select * from pg_walfile_name('5/dfdf');
 pg_walfile_name
--
 00010005
(1 row)
postgres=# select * from pg_walfile_name_offset('5/dfdf');
file_name | file_offset
--+-
 00010005 |   57311
(1 row)

Regards,
Bharath Rupireddy.




Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Robert Haas
On Thu, Apr 7, 2022 at 7:01 PM Andres Freund  wrote:
> On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote:
> > We should definitely increase MaxHeapTuplesPerPage before too long,
> > for a variety of reasons that I have talked about in the past. Its
> > current value is 291 on all mainstream platforms, a value that's
> > derived from accidental historic details -- which predate HOT.
>
> I'm on-board with that - but I think we should rewrite a bunch of places that
> use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
> several KB of stack at the current the current value already (*), but if it 
> grows
> further...

I agree that the value of 291 is pretty much accidental, but it also
seems fairly generous to me. The bigger you make it, the more space
you can waste. I must have missed (or failed to understand) previous
discussions about why raising it would be a good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Support logical replication of DDLs

2022-04-08 Thread Robert Haas
On Fri, Apr 8, 2022 at 7:34 AM Alvaro Herrera  wrote:
> > For runtime conditions, one of the things you have mentioned in that
> > thread is to add schema name in the statement at the required places
> > which this patch deals with in a different way by explicitly sending
> > it along with the DDL statement.
>
> Hmm, ok.  The point of the JSON-blob route is that the publisher sends a
> command representation that can be parsed/processed/transformed
> arbitrarily by the subscriber using generic rules; it should be trivial
> to use a JSON tool to change schema A to schema B in any arbitrary DDL
> command, and produce another working DDL command without having to know
> how to write that command specifically.  So if I have a rule that
> "schema A there is schema B here", all DDL commands can be replayed with
> no further coding (without having to rely on getting the run-time
> search_path correct.)

Yeah, that was a really nice aspect of that approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Extract epoch from Interval weird behavior

2022-04-08 Thread Tom Lane
Peter Eisentraut  writes:
> We really wanted to avoid doing calculations in numeric as much as 
> possible.  So we should figure out a different way to write this.  The 
> attached patch works for me.  It's a bit ugly since it hardcodes some 
> factors.  Maybe we can rephrase it a bit more elegantly.

I think it's fine but needs some commentary.  Maybe about like
"To do this calculation in integer arithmetic even though
DAYS_PER_YEAR is fractional, multiply everything by 4
and then divide by 4 again at the end.  This relies on
DAYS_PER_YEAR being a multiple of 0.25 and on SECS_PER_DAY
being a multiple of 4."

BTW, it might be good to parenthesize as

(... big calculation ...) * (SECS_PER_DAY/4)

to eliminate any question of whether the value could overflow
before the final division by 4.

regards, tom lane




Re: Mark all GUC variable as PGDLLIMPORT

2022-04-08 Thread Magnus Hagander
On Fri, Apr 8, 2022 at 2:42 PM Robert Haas  wrote:

> On Wed, Apr 6, 2022 at 7:56 PM Michael Paquier 
> wrote:
> > On Wed, Apr 06, 2022 at 12:57:29AM +0700, John Naylor wrote:
> > > For these two patches, I'd say a day or two after feature freeze is a
> > > reasonable goal.
> >
> > Yeah.  For patches as invasive as the PGDLLIMPORT business and the
> > frontend error refactoring, I am also fine to have two exceptions with
> > the freeze deadline.
>
> Done now.
>

\o/

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Commitfest Closed

2022-04-08 Thread Alvaro Herrera
On 2022-Apr-08, Greg Stark wrote:

> Thanks to all the reviewers and committers who put a lot of work in,
> especially in the last two weeks. I especially want to thank Andres
> who showed me how to use the cfbot to check on patch statuses and did
> a lot of work doing that until I was up to speed.

Thanks for herding through the CF!

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Commitfest Closed

2022-04-08 Thread Greg Stark
It has reached 2022-03-08 Anywhere on Earth[*] so I believe that means
Postgres 15 Feature Freeze is in effect (modulo a couple patches that
were held until the end of the commitfest to make merging easier).

I've marked the commitfest closed and will be moving any patches that
didn't receive feedback over to the next commitfest. I think this is
most of the remaining patches though there may be a few Waiting for
Author patches that can be Returned with Feedback or even Rejected.
I'll do the Ready for Committer patches last to allow for the
stragglers held back.

It's always frustrating seeing patches get ignored but on the plus
side nearly 100 patches are marked Committed and a lot of patches did
get feedback.

Thanks to all the reviewers and committers who put a lot of work in,
especially in the last two weeks. I especially want to thank Andres
who showed me how to use the cfbot to check on patch statuses and did
a lot of work doing that until I was up to speed.

[*] https://www.timeanddate.com/time/zones/aoe

-- 
greg




Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-04-08 Thread Justin Pryzby
On Wed, Apr 06, 2022 at 03:58:29PM +0900, Etsuro Fujita wrote:
> I have committed the patch after modifying it as such.  (I think we
> can improve these later, if necessary.)

This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.

| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= 
pg_trigger_depth();

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x55b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at 
../../../../src/include/nodes/pg_list.h:151
151 return l ? l->length : 0;
(gdb) bt
#0  0x55b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at 
../../../../src/include/nodes/pg_list.h:151
#1  0x55b43968af89 in mark_async_capable_plan 
(plan=plan@entry=0x7f4219ed93b0, path=path@entry=0x7f4219e89538) at 
createplan.c:1132
#2  0x55b439691924 in create_append_plan (root=root@entry=0x55b43affb2b0, 
best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at 
createplan.c:1329
#3  0x55b43968fa21 in create_plan_recurse (root=root@entry=0x55b43affb2b0, 
best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at 
createplan.c:421
#4  0x55b43968f974 in create_projection_plan 
(root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0f60, 
flags=flags@entry=1) at createplan.c:2039
#5  0x55b43968fa6f in create_plan_recurse (root=root@entry=0x55b43affb2b0, 
best_path=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:433
#6  0x55b439690221 in create_plan (root=root@entry=0x55b43affb2b0, 
best_path=) at createplan.c:348
#7  0x55b4396a1451 in standard_planner (parse=0x55b43af05e28, 
query_string=, cursorOptions=2048, boundParams=0x0) at 
planner.c:413
#8  0x55b4396a19c1 in planner (parse=parse@entry=0x55b43af05e28, 
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM 
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();", 
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) 
at planner.c:277
#9  0x55b439790c78 in pg_plan_query 
(querytree=querytree@entry=0x55b43af05e28, 
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM 
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();", 
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) 
at postgres.c:883
#10 0x55b439790d54 in pg_plan_queries (querytrees=0x55b43afdd528, 
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM 
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();", 
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) 
at postgres.c:975
#11 0x55b439791239 in exec_simple_query 
(query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM 
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();") at 
postgres.c:1169
#12 0x55b439793183 in PostgresMain (dbname=, 
username=) at postgres.c:4542
#13 0x55b4396e6af7 in BackendRun (port=port@entry=0x55b43af2ffe0) at 
postmaster.c:4489
#14 0x55b4396e9c03 in BackendStartup (port=port@entry=0x55b43af2ffe0) at 
postmaster.c:4217
#15 0x55b4396e9e4a in ServerLoop () at postmaster.c:1791
#16 0x55b4396eb401 in PostmasterMain (argc=7, argv=) at 
postmaster.c:1463
#17 0x55b43962b4df in main (argc=7, argv=0x55b43aeff0c0) at main.c:202

Actually, the original query failed like this:
#2  0x55b4398e9f90 in ExceptionalCondition 
(conditionName=conditionName@entry=0x55b439a61238 "plan->scanstatus == 
SUBQUERY_SCAN_UNKNOWN", errorType=errorType@entry=0x55b43994b00b 
"FailedAssertion", 
#3  0x55b4396a2ecf in trivial_subqueryscan (plan=0x55b43b59cac8) at 
setrefs.c:1367




Re: Mark all GUC variable as PGDLLIMPORT

2022-04-08 Thread Robert Haas
On Wed, Apr 6, 2022 at 7:56 PM Michael Paquier  wrote:
> On Wed, Apr 06, 2022 at 12:57:29AM +0700, John Naylor wrote:
> > For these two patches, I'd say a day or two after feature freeze is a
> > reasonable goal.
>
> Yeah.  For patches as invasive as the PGDLLIMPORT business and the
> frontend error refactoring, I am also fine to have two exceptions with
> the freeze deadline.

Done now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread Magnus Hagander
On Fri, Apr 8, 2022 at 2:19 PM David Rowley  wrote:

> On Fri, 8 Apr 2022 at 23:27, Magnus Hagander  wrote:
> >
> >
> >
> > On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander 
> wrote:
> >>
> >> On Tue, Mar 29, 2022 at 10:06 PM David Rowley 
> wrote:
> >>>
> >>> If we go with this patch,  the problem I see here is that the amount
> >>> of work the JIT compiler must do for a given query depends mostly on
> >>> the number of expressions that must be compiled in the query (also to
> >>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
> >>> jit_tuple_deforming and jit_expressions). The DBA does not really have
> >>> much control over the number of expressions in the query.  All he or
> >>> she can do to get rid of the warning is something like increase
> >>> jit_above_cost.  After a few iterations of that, the end result is
> >>> that jit_above_cost is now high enough that JIT no longer triggers
> >>> for, say, that query to that table with 1000 partitions where no
> >>> plan-time pruning takes place.  Is that really a good thing? It likely
> >>> means that we just rarely JIT anything at all!
> >>
> >>
> >> I don't agree with the conclusion of that.
> >>
> >> What the parameter would be useful for is to be able to tune those
> costs (or just turn it off) *for that individual query*. That doesn't mean
> you "rarely JIT anything atll", it just means you rarely JIT that
> particular query.
>
> I just struggle to imagine that anyone is going to spend much effort
> tuning a warning parameter per query.  I imagine they're far more
> likely to just ramp it up to only catch some high percentile problems
> or just (more likely) just not bother with it.  It seems more likely
> that if anyone was to tune anything per query here it would be
> jit_above_cost, since that actually might have an affect on the
> performance of the query, rather than if it spits out some warning
> message or not.  ISTM that if the user knows what to set it to per
> query, then there's very little point in having a warning as we'd be
> alerting them to something they already know about.
>

I would not expect people to tune the *warning* at a query level. If
anything, then ys, they would tune the either jit_above_cost or just
jit=off. But the idea being you can do that on a per query level instead of
globally.


I looked in the -general list to see if we could get some common
> explanations to give us an idea of the most common reason for high JIT
> compilation time. It seems that the plans were never simple. [1] seems
> due to a complex plan. I'm basing that off the "Functions: 167". I
> didn't catch the full plan. From what I can tell, [2] seems to be due
> to "lots of empty tables", so assuming the clamping at 1 page is
> causing issues there.  I think both of those cases could be resolved
> by building the costing the way I mentioned.  I admit that 2 cases is
> not a very large sample size.
>

Again, I am very much for improvements of the costing model. This is in no
way intended to be a replacement for that. It's intended to be a stop-gap.

What I see much of today are things like
https://dba.stackexchange.com/questions/264955/handling-performance-problems-with-jit-in-postgres-12
or
https://dev.to/xenatisch/cascade-of-doom-jit-and-how-a-postgres-update-led-to-70-failure-on-a-critical-national-service-3f2a

The bottom line is that people end up with recommendations to turn off JIT
globally more or less by default. Because there's no real useful way today
to figure out when it causes problems vs when it helps.

The addition to pg_stat_statements I pushed a short while ago would help
with that. But I think having a warning like this would also be useful. As
a stop-gap measure, yes, but we really don't know when we will have an
improved costing model for it. I hope you're right and that we can have it
by 16, and then I will definitely advocate for removing the warning again
if it works.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-08 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> The new krb_user_ccache is a lot closer to 'global', though it's
> specifically for user-authenticated backends (allowing the postmaster
> and other things like replication connections to use whatever the
> credential cache is set to by the administrator on startup), but that
> seems like it makes sense to me- generally you're not going to want
> regular user backends to be accessing the credential cache of the
> 'postgres' unix account on the server.

Added an explicit 'environment' option to allow for, basically, existing
behavior, where we don't mess with the environment variable at all,
though I kept the default as MEMORY since I don't think it's really
typical that folks actually want regular user backends to inherit the
credential cache of the server.

Added a few more tests and updated the documentation too.  Sadly, seems
we've missed the deadline for v15 though for lack of feedback on these.
Would really like to get some other folks commenting as these are new
pg_hba and postgresql.conf options being added.

Thanks!

Stephen
From bd248c3fd82d04d3c12bf6c777f861134a45a101 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Thu, 7 Apr 2022 15:34:39 -0400
Subject: [PATCH] Add support for Kerberos credential delegation

Accept GSSAPI/Kerberos delegated credentials.  With this, a user could
authenticate to PostgreSQL using Kerberos credentials, delegate
credentials to the PostgreSQL server, and then the PostgreSQL server
could use those credentials to connect to another service, such as with
postgres_fdw or dblink or theoretically any other authenticated
connection which is able to use delegated credentials.

If an administrator prefers to not allow credentials to be delegated to
the server, they can be disallowed using a new pg_hba option for gss
called 'allow_cred_delegation'.

A new server GUC has also been introduced to allow an administrator to
control what the kerberos credential cache is configured to for user
authenticated backends, krb_user_ccache.  This defaults to MEMORY:,
which is where delegated credentials are stored (and is otherwise empty,
avoiding the risk of an administrator's credentials on the server being
mistakenly picked up and used).

Original patch by: Peifeng Qiu, whacked around some by me.
Reviewed-by: Jacob Champion
Discussion: https://postgr.es/m/co1pr05mb8023cc2cb575e0faad7df4f8a8...@co1pr05mb8023.namprd05.prod.outlook.com
---
 contrib/dblink/dblink.c   |   6 +-
 .../postgres_fdw/expected/postgres_fdw.out|   2 +-
 contrib/postgres_fdw/option.c |   3 +
 doc/src/sgml/client-auth.sgml |  13 ++
 doc/src/sgml/config.sgml  |  28 
 doc/src/sgml/libpq.sgml   |  19 +++
 src/backend/libpq/auth.c  |  27 +++-
 src/backend/libpq/be-gssapi-common.c  |  51 +++
 src/backend/libpq/be-secure-gssapi.c  |  19 ++-
 src/backend/libpq/hba.c   |  19 +++
 src/backend/utils/adt/hbafuncs.c  |   4 +
 src/backend/utils/init/postinit.c |   8 +-
 src/backend/utils/misc/guc.c  |  15 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/libpq/auth.h  |   1 +
 src/include/libpq/be-gssapi-common.h  |   3 +
 src/include/libpq/hba.h   |   1 +
 src/include/libpq/libpq-be.h  |   3 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-auth.c|  12 +-
 src/interfaces/libpq/fe-connect.c |  12 ++
 src/interfaces/libpq/fe-secure-gssapi.c   |   3 +-
 src/interfaces/libpq/libpq-fe.h   |   1 +
 src/interfaces/libpq/libpq-int.h  |   1 +
 src/test/kerberos/Makefile|   3 +
 src/test/kerberos/t/001_auth.pl   | 128 --
 src/test/perl/PostgreSQL/Test/Utils.pm|  27 
 27 files changed, 391 insertions(+), 20 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index a06d4bd12d..e5b70e084e 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2643,7 +2643,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
 {
 	if (!superuser())
 	{
-		if (!PQconnectionUsedPassword(conn))
+		if (!(PQconnectionUsedPassword(conn) || PQconnectionUsedGSSAPI(conn)))
 		{
 			PQfinish(conn);
 			ReleaseExternalFD();
@@ -2652,8 +2652,8 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
 
 			ereport(ERROR,
 	(errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
-	 errmsg("password is required"),
-	 errdetail("Non-superuser cannot connect if the server does not request a password."),
+	 errmsg("password or GSSAPI is required"),
+	 errdetail("Non-superuser cannot connect if the server does not request a password or use GSSAPI."),
 	 errhint("Target server's authentication method must be 

Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread David Rowley
On Fri, 8 Apr 2022 at 23:27, Magnus Hagander  wrote:
>
>
>
> On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander  wrote:
>>
>> On Tue, Mar 29, 2022 at 10:06 PM David Rowley  wrote:
>>>
>>> If we go with this patch,  the problem I see here is that the amount
>>> of work the JIT compiler must do for a given query depends mostly on
>>> the number of expressions that must be compiled in the query (also to
>>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
>>> jit_tuple_deforming and jit_expressions). The DBA does not really have
>>> much control over the number of expressions in the query.  All he or
>>> she can do to get rid of the warning is something like increase
>>> jit_above_cost.  After a few iterations of that, the end result is
>>> that jit_above_cost is now high enough that JIT no longer triggers
>>> for, say, that query to that table with 1000 partitions where no
>>> plan-time pruning takes place.  Is that really a good thing? It likely
>>> means that we just rarely JIT anything at all!
>>
>>
>> I don't agree with the conclusion of that.
>>
>> What the parameter would be useful for is to be able to tune those costs (or 
>> just turn it off) *for that individual query*. That doesn't mean you "rarely 
>> JIT anything atll", it just means you rarely JIT that particular query.

I just struggle to imagine that anyone is going to spend much effort
tuning a warning parameter per query.  I imagine they're far more
likely to just ramp it up to only catch some high percentile problems
or just (more likely) just not bother with it.  It seems more likely
that if anyone was to tune anything per query here it would be
jit_above_cost, since that actually might have an affect on the
performance of the query, rather than if it spits out some warning
message or not.  ISTM that if the user knows what to set it to per
query, then there's very little point in having a warning as we'd be
alerting them to something they already know about.

I looked in the -general list to see if we could get some common
explanations to give us an idea of the most common reason for high JIT
compilation time. It seems that the plans were never simple. [1] seems
due to a complex plan. I'm basing that off the "Functions: 167". I
didn't catch the full plan. From what I can tell, [2] seems to be due
to "lots of empty tables", so assuming the clamping at 1 page is
causing issues there.  I think both of those cases could be resolved
by building the costing the way I mentioned.  I admit that 2 cases is
not a very large sample size.

David

[1] 
https://www.postgresql.org/message-id/flat/CAPL5KHq8zfWPzueCemXw4c%2BU568PoDfqo3wBDNm3KAyvybdaMQ%40mail.gmail.com#35aca8b42c3862f44b6be5b260c1a109
[2] 
https://www.postgresql.org/message-id/flat/CAHOFxGo5xJt02RmwAWrtv2K0jcqqxG-cDiR8FQbvb0WxdKhcgw%40mail.gmail.com#12d91822e869a2e22ca830cb5632f549




Re: SQL/JSON: functions

2022-04-08 Thread Andrew Dunstan


On 4/8/22 08:02, Justin Pryzby wrote:
> On Thu, Mar 31, 2022 at 04:25:58PM -0400, Andrew Dunstan wrote:
>> No code chunks left, only a documentation patch which should land
> Documentation review for a6baa4bad.
>
>> Construct a JSON the provided strings:
> a JSON what ?
> *from* the provided strings ?
>
>> Construct a JSON from the provided values various types:
> should say "a JSON scalar" ?
> *of* various types ?
>
>> Construct a JSON object from the provided key/value pairs of various types:
> For comparison, that one looks ok.
>
> +  JSON_EXISTS function checks whether the provided
> +   JSON_VALUE function extracts a value from the 
> provided
> +   JSON_QUERY function extracts an 
> SQL/JSON
> +  JSON_TABLE function queries 
> JSON data
> + JSON_TABLE uses the
> +  JSON_SERIALIZE function transforms a SQL/JSON 
> value
>
> I think all these should all begin with "THE >...< function ...", like the
> others do.
>
> +To use other types, you must create the CAST from 
> json for this type.
> => create a cast from json to this type.
>
> +Values can be null, but not keys.
> I think it's clearer to say "..but keys cannot."
>
> +  For any scalar other than a number or a Boolean the text
>
> Boolean COMMA the text
>
> + The path name must be unique and cannot coincide with column names.
> Maybe say "name must be unique and distinct from the column names."
>
> +  ... If you specify a GROUP BY
> +  or an ORDER BY clause, this function returns a 
> separate JSON object
> +  for each table row.
>
> "for each table row" sounds inaccurate or imprecise.  The SELECT docs say 
> this:
> | GROUP BY will condense into a single row all selected rows that share the 
> same values for the grouped expressions
>
> BTW, the documentation references look a little like OIDs...
> Does someone already have an SNMP-based doc browser ?
> | For details, see Section 9.16.3.4.2.



Many thanks, useful.

I already had a couple of these items on my list but I ran out of time
before tiredness overcame me last night.

I'm planning on removing some of that stuff that generates the last
complaint if I can do it without too much violence.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: SQL/JSON: functions

2022-04-08 Thread Justin Pryzby
On Thu, Mar 31, 2022 at 04:25:58PM -0400, Andrew Dunstan wrote:
> No code chunks left, only a documentation patch which should land

Documentation review for a6baa4bad.

> Construct a JSON the provided strings:

a JSON what ?
*from* the provided strings ?

> Construct a JSON from the provided values various types:

should say "a JSON scalar" ?
*of* various types ?

> Construct a JSON object from the provided key/value pairs of various types:

For comparison, that one looks ok.

+  JSON_EXISTS function checks whether the provided
+   JSON_VALUE function extracts a value from the provided
+   JSON_QUERY function extracts an 
SQL/JSON
+  JSON_TABLE function queries JSON 
data
+ JSON_TABLE uses the
+  JSON_SERIALIZE function transforms a SQL/JSON value

I think all these should all begin with "THE >...< function ...", like the
others do.

+To use other types, you must create the CAST from 
json for this type.
=> create a cast from json to this type.

+Values can be null, but not keys.
I think it's clearer to say "..but keys cannot."

+  For any scalar other than a number or a Boolean the text

Boolean COMMA the text

+ The path name must be unique and cannot coincide with column names.
Maybe say "name must be unique and distinct from the column names."

+  ... If you specify a GROUP BY
+  or an ORDER BY clause, this function returns a 
separate JSON object
+  for each table row.

"for each table row" sounds inaccurate or imprecise.  The SELECT docs say this:
| GROUP BY will condense into a single row all selected rows that share the 
same values for the grouped expressions

BTW, the documentation references look a little like OIDs...
Does someone already have an SNMP-based doc browser ?
| For details, see Section 9.16.3.4.2.




Re: PROXY protocol support

2022-04-08 Thread Magnus Hagander
On Sat, Apr 2, 2022 at 12:17 AM wilfried roset 
wrote:

> Hi,
>
> I've been able to test the patch. Here is a recap of the experimentation.
>
> # Setup
>
> All tests have been done witch 3 VMs (PostgreSQL, HAproxy, psql client) on
> Debian 11 communicating over private network.
> * PostgreSQL have been built with proxy_protocol_11.patch applied on
> master branch (465ab24296).
> * psql client is from postgresql-client-13 from Debian 11 repository.
> * HAproxy version used is 2.5.5-1~bpo11+1 installed from
> https://haproxy.debian.net
>
> # Configuration
>
> PostgresSQL has been configured to listen only on its private IP. To enable
> proxy protocol support `proxy_port` has been configured to `5431` and
> `proxy_servers` to `10.0.0.0/24` . `log_connections`
> has been turned on to make
> sure the correct IP address is logged. `log_min_duration_statement` has
> been
> configured to 0 to log all queries. Finally `log_destination` has been
> configured to `csvlog`.
>
> pg_hba.conf is like this:
>
>   local   all all trust
>   hostall all 127.0.0.1/32trust
>   hostall all ::1/128 trust
>   local   replication all trust
>   hostreplication all 127.0.0.1/32trust
>   hostreplication all ::1/128 trust
>   hostall all 10.0.0.208/32   md5
>
> Where 10.0.0.208 is the IP host the psql client's VM.
>
> HAproxy has two frontends, one for proxy protocol (port 5431) and one for
> regular TCP traffic. The configuration looks like this:
>
>   listen postgresql
>   bind 10.0.0.222:5432
>   server pg 10.0.0.253:5432 check
>
>   listen postgresql_proxy
>   bind 10.0.0.222:5431
>   server pg 10.0.0.253:5431 send-proxy-v2
>
> Where 10.0.0.222 is the IP of HAproxy's VM and 10.0.0.253 is the IP of
> PostgreSQL's VM.
>
> # Tests
>
> * from psql's vm to haproxy on port 5432 (no proxy protocol)
>   --> connection denied by pg_hba.conf, as expected
>
> * from psql's vm to postgresql's VM on port 5432 (no proxy protocol)
>   --> connection success with psql's vm ip in logfile and pg_stat_activity
>
> * from psql's vm to postgresql's VM on port 5431 (proxy protocol)
>   --> unable to open a connection, as expected
>
> * from psql's vm to haproxy on port 5431 (proxy protocol)
>   --> connection success with psql's vm ip in logfile and pg_stat_activity
>
> I've also tested without proxy protocol enable (and pg_hba.conf updated
> accordingly), PostgreSQL behave as expected.
>
> # Conclusion
>
> From my point of view the documentation is clear enough and the feature
> works
> as expected.


Hi!

Thanks for this review and testing!

I think it could do with at least noe more look-over at the source code
level as well at this point though since it's been sitting around for a
while, so it won't make it in for this deadline. But hopefully I can get it
in early in the next cycle!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: shared-memory based stats collector

2022-04-08 Thread Ranier Vilela
Hi,

Per Coverity.

pgstat_reset_entry does not check if lock it was really blocked.
I think if shared_stat_reset_contents is called without lock,
is it an issue not?

regards,

Ranier Vilela


0001-avoid-reset-stats-without-lock.patch
Description: Binary data


Re: Expose JIT counters/timing in pg_stat_statements

2022-04-08 Thread Magnus Hagander
On Tue, Mar 8, 2022 at 4:08 AM Julien Rouhaud  wrote:

> On Mon, Mar 07, 2022 at 01:40:34PM +0100, Magnus Hagander wrote:
> >
> > I wonder if there might be an interesting middle ground, or if that is
> > making it too much. That is, we could have an
> > Option 3:
> > jit_count
> > total_jit_time - for sum of functions+inlining+optimization+emission time
> > min_jit_time - for sum of functions+inlining+optimization+emission time
> > max_jit_time - for sum of functions+inlining+optimization+emission time
> > mean_jit_time - for sum of functions+inlining+optimization+emission time
> > stddev_jit_time - for sum of functions+inlining+optimization+emission
> time
> > jit_functions
> > jit_generation_time
> > jit_inlining_count
> > jit_inlining_time
> > jit_optimization_count
> > jit_optimization_time
> > jit_emission_count
> > jit_emission_time
> >
> > That is, we'd get the more detailed timings across the total time, but
> > not on the details. But that might be overkill.
>
> I also thought about it but it seems overkill.  pg_stat_statements view is
> already very big, and I think that the JIT time should be somewhat stable,
> at
> least compared to how much a query execution time can vary depending on the
> parameters.  This approach would also be a bit useless if you change the
> costing of underlying JIT operation.
>
> > But -- here's an updated patched based on Option 2.
>
> Thanks!
>
> Code-wide, the patch looks good.  For the doc, it seems that you documented
> jit_inlining_count three times rather than documenting
> jit_optimization_count
> and jit_emission_count.
>

Oops, thanks and fixed.


I don't think we can add tests there, and having a test for every new
> counter
> being >= 0 seems entirely useless, however there should be a new test
> added for
> the "oldextversions" test to make sure that there's no issue with old SQL
> / new
> shlib compatibility.  And looking at it I see that it was already missed
> for
> version 1.9 :(
>

Indeed. Fixed here.

Michael had already applied a patch that took us to 1.10 and added that
test, so I've just updated it here. I don't think we normally bump the
version twice int he same day, so I just mergd the SQL script changes as
well.

PFA a "final" version for the CI to run.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index 2813eb16cb..efb2049ecf 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -197,44 +197,52 @@ SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
 -- New functions and views for pg_stat_statements in 1.10
 AlTER EXTENSION pg_stat_statements UPDATE TO '1.10';
 \d pg_stat_statements
-View "public.pg_stat_statements"
-   Column|   Type   | Collation | Nullable | Default 
--+--+---+--+-
- userid  | oid  |   |  | 
- dbid| oid  |   |  | 
- toplevel| boolean  |   |  | 
- queryid | bigint   |   |  | 
- query   | text |   |  | 
- plans   | bigint   |   |  | 
- total_plan_time | double precision |   |  | 
- min_plan_time   | double precision |   |  | 
- max_plan_time   | double precision |   |  | 
- mean_plan_time  | double precision |   |  | 
- stddev_plan_time| double precision |   |  | 
- calls   | bigint   |   |  | 
- total_exec_time | double precision |   |  | 
- min_exec_time   | double precision |   |  | 
- max_exec_time   | double precision |   |  | 
- mean_exec_time  | double precision |   |  | 
- stddev_exec_time| double precision |   |  | 
- rows| bigint   |   |  | 
- shared_blks_hit | bigint   |   |  | 
- shared_blks_read| bigint   |   |  | 
- shared_blks_dirtied | bigint   |   |  | 
- shared_blks_written | bigint   |   |  | 
- local_blks_hit  | bigint   |   |  | 
- local_blks_read | bigint   |   |  | 
- local_blks_dirtied  | bigint   |   |  | 
- local_blks_written  | bigint   |   |  | 
- temp_blks_read  | bigint   |   |  | 
- temp_blks_written   | bigint   |   |  | 
- 

Re: generic plans and "initial" pruning

2022-04-08 Thread Amit Langote
Hi David,

On Fri, Apr 8, 2022 at 8:16 PM David Rowley  wrote:
> On Fri, 8 Apr 2022 at 17:49, Amit Langote  wrote:
> > Attached updated patch with these changes.
> Thanks for making the changes.  I started looking over this patch but
> really feel like it needs quite a few more iterations of what we've
> just been doing to get it into proper committable shape. There seems
> to be only about 40 mins to go before the freeze, so it seems very
> unrealistic that it could be made to work.

Yeah, totally understandable.

> I started trying to take a serious look at it this evening, but I feel
> like I just failed to get into it deep enough to make any meaningful
> improvements.  I'd need more time to study the problem before I could
> build up a proper opinion on how exactly I think it should work.
>
> Anyway. I've attached a small patch that's just a few things I
> adjusted or questions while reading over your v13 patch.  Some of
> these are just me questioning your code (See XXX comments) and some I
> think are improvements. Feel free to take the hunks that you see fit
> and drop anything you don't.

Thanks a lot for compiling those.

Most looked fine changes to me except a couple of typos, so I've
adopted those into the attached new version, even though I know it's
too late to try to apply it.  Re the XXX comments:

+ /* XXX why would pprune->rti_map[i] ever be zero here??? */

Yeah, no there can't be, was perhaps being overly paraioid.

+ * XXX is it worth doing a bms_copy() on glob->minLockRelids if
+ * glob->containsInitialPruning is true?. I'm slighly worried that the
+ * Bitmapset could have a very long empty tail resulting in excessive
+ * looping during AcquireExecutorLocks().
+ */

I guess I trust your instincts about bitmapset operation efficiency
and what you've written here makes sense.  It's typical for leaf
partitions to have been appended toward the tail end of rtable and I'd
imagine their indexes would be in the tail words of minLockRelids.  If
copying the bitmapset removes those useless words, I don't see why we
shouldn't do that.  So added:

+ /*
+ * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
+ * bit from it just above to prevent empty tail bits resulting in
+ * inefficient looping during AcquireExecutorLocks().
+ */
+ if (glob->containsInitialPruning)
+ glob->minLockRelids = bms_copy(glob->minLockRelids)

Not 100% about the comment I wrote.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v14-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch
Description: Binary data


Re: Extract epoch from Interval weird behavior

2022-04-08 Thread Peter Eisentraut

On 24.02.22 03:35, Joseph Koshakow wrote:

However when executing EXTRACT we first truncate
DAYS_PER_YEAR to an integer, and then multiply it
by the total years in the Interval
/* this always fits into int64 */

secs_from_day_month = ((int64) DAYS_PER_YEAR * (interval->month / 
MONTHS_PER_YEAR) +
  (int64) DAYS_PER_MONTH * (interval->month % 
MONTHS_PER_YEAR) +
   interval->day) * SECS_PER_DAY;

Is this truncation on purpose? It seems like
EXTRACT is not accounting for leap years in
it's calculation.


This was not intentional.  The cast is only to make the multiplication 
happen in int64; it didn't mean to drop any fractional parts.



Oops I sent that to the wrong email. If this isn't intented I've created a patch
that fixes it, with the following two open questions
  * DAYS_PER_YEAR_NUM is recalculated every time. Is there anyway
to convert a float directly to a numeric to avoid this?


We really wanted to avoid doing calculations in numeric as much as 
possible.  So we should figure out a different way to write this.  The 
attached patch works for me.  It's a bit ugly since it hardcodes some 
factors.  Maybe we can rephrase it a bit more elegantly.From 0b7222beb5260c710d79c9a0573c3b39a64acf1b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 8 Apr 2022 13:16:25 +0200
Subject: [PATCH v1] Fix extract epoch from interval calculation

The new numeric code for extract epoch from interval accidentally
truncated the DAYS_PER_YEAR value to an integer, leading to results
that mismatched the floating-point interval_part calculations.

The commit a2da77cdb4661826482ebf2ddba1f953bc74afe4 that introduced
this actually contains the regression test change that this reverts.
I suppose this was missed at the time.

Reported-by: Joseph Koshakow 
Discussion: 
https://www.postgresql.org/message-id/flat/CAAvxfHd5n%3D13NYA2q_tUq%3D3%3DSuWU-CufmTf-Ozj%3DfrEgt7pXwQ%40mail.gmail.com
---
 src/backend/utils/adt/timestamp.c  | 6 +++---
 src/test/regress/expected/interval.out | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/adt/timestamp.c 
b/src/backend/utils/adt/timestamp.c
index 1c0bf0aa5c..2677d27632 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -5288,9 +5288,9 @@ interval_part_common(PG_FUNCTION_ARGS, bool retnumeric)
int64   val;
 
/* this always fits into int64 */
-   secs_from_day_month = ((int64) DAYS_PER_YEAR * 
(interval->month / MONTHS_PER_YEAR) +
-  (int64) 
DAYS_PER_MONTH * (interval->month % MONTHS_PER_YEAR) +
-  
interval->day) * SECS_PER_DAY;
+   secs_from_day_month = ((int64) (4 * DAYS_PER_YEAR) * 
(interval->month / MONTHS_PER_YEAR) +
+  (int64) (4 * 
DAYS_PER_MONTH) * (interval->month % MONTHS_PER_YEAR) +
+  (int64) 4 * 
interval->day) * SECS_PER_DAY/4;
 
/*---
 * result = secs_from_day_month + interval->time / 
1'000'000
diff --git a/src/test/regress/expected/interval.out 
b/src/test/regress/expected/interval.out
index e4b1246f45..8e2d535543 100644
--- a/src/test/regress/expected/interval.out
+++ b/src/test/regress/expected/interval.out
@@ -953,11 +953,11 @@ SELECT f1,
  @ 1 min   |   0 |   0.000 |   0.00 |  
1 |0 |   0 | 0 |   1 |0 |  0 |   0 |  0 |   
  60.00
  @ 5 hours |   0 |   0.000 |   0.00 |  
0 |5 |   0 | 0 |   1 |0 |  0 |   0 |  0 |  
18000.00
  @ 10 days |   0 |   0.000 |   0.00 |  
0 |0 |  10 | 0 |   1 |0 |  0 |   0 |  0 | 
864000.00
- @ 34 years|   0 |   0.000 |   0.00 |  
0 |0 |   0 | 0 |   1 |   34 |  3 |   0 |  0 | 
1072224000.00
+ @ 34 years|   0 |   0.000 |   0.00 |  
0 |0 |   0 | 0 |   1 |   34 |  3 |   0 |  0 | 
1072958400.00
  @ 3 mons  |   0 |   0.000 |   0.00 |  
0 |0 |   0 | 3 |   2 |0 |  0 |   0 |  0 |
7776000.00
  @ 14 secs ago |   -1400 |  -14000.000 | -14.00 |  
0 |0 |   0 | 0 |   1 |0 |  0 |   0 |  0 |   
 -14.00
  @ 1 day 2 hours 3 mins 4 secs | 400 |4000.000 |   4.00 |  
3 |2 |   1 | 0 |   1 |0 |  0 |   0 |  0 |  
93784.00
- @ 6 years | 

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Matthias van de Meent
On Fri, 8 Apr 2022 at 01:01, Andres Freund  wrote:
>
> Hi,
>
> On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote:
> > We should definitely increase MaxHeapTuplesPerPage before too long,
> > for a variety of reasons that I have talked about in the past. Its
> > current value is 291 on all mainstream platforms, a value that's
> > derived from accidental historic details -- which predate HOT.
>
> I'm on-board with that - but I think we should rewrite a bunch of places that
> use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
> several KB of stack at the current the current value already (*), but if it 
> grows
> further...

Yeah, I think we should definately support more line pointers on a
heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
the current value is the physical limit for heap tuples, as we have at
most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
won't change. A macro MaxHeapLinePointersPerPage would probably be
more useful, which could be as follows (assuming we don't want to
allow filling a page with effectively only dead line pointers):

#define MaxHeapLinePointersPerPage \
   ((int) (((BLCKSZ - SizeOfPageHeaderData) / \
 (MAXALIGN(SizeofHeapTupleHeader) + 2 * sizeof(ItemIdData))) * 2))

This accounts for the worst case of one redirect + one min-sized live
heap tuple, and fills the page with it. Although impossible to put a
page in such a state, that would be the worst case of live line
pointers on a page.
For the default BLCKSZ of 8kB, that results in 510 line pointers
used-but-not-dead, an increase of ~ 70% over what's currently
available.

-Matthias




Re: Support logical replication of DDLs

2022-04-08 Thread Alvaro Herrera
On 2022-Apr-08, Amit Kapila wrote:

> On Thu, Mar 17, 2022 at 3:36 AM Alvaro Herrera  
> wrote:
> >
> > Did you see some old code I wrote towards this goal?
> > https://www.postgresql.org/message-id/20150215044814.gl3...@alvh.no-ip.org
> > The intention was that DDL would produce some JSON blob that accurately
> > describes the DDL that was run;
> 
> I have read that thread and found one of your emails [1] where you
> seem to be saying that JSON representation is not required for BDR.
> Will in some way going via JSON blob way make this project
> easier/better?

I don't know if replication support will be easier by using JSON; I just
think that JSON makes the overall feature more easily usable for other
purposes.

I am not familiar with BDR replication nowadays.

> For runtime conditions, one of the things you have mentioned in that
> thread is to add schema name in the statement at the required places
> which this patch deals with in a different way by explicitly sending
> it along with the DDL statement.

Hmm, ok.  The point of the JSON-blob route is that the publisher sends a
command representation that can be parsed/processed/transformed
arbitrarily by the subscriber using generic rules; it should be trivial
to use a JSON tool to change schema A to schema B in any arbitrary DDL
command, and produce another working DDL command without having to know
how to write that command specifically.  So if I have a rule that
"schema A there is schema B here", all DDL commands can be replayed with
no further coding (without having to rely on getting the run-time
search_path correct.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El sudor es la mejor cura para un pensamiento enfermo" (Bardia)




Re: Last day of commitfest

2022-04-08 Thread Magnus Hagander
On Thu, Apr 7, 2022 at 3:59 AM Julien Rouhaud  wrote:

>
> > * JIT counters in pg_stat_statements (Magnus Hagander)
> >  Feedback from Dmitry Dolgov and Julien Rouhaud
>
> Note that the code looks good and no one disagreed with the proposed
> fields.
>
> The only remaining problem is a copy/pasto in the docs so nothing
> critical.  I
> personally think that it would be very good to have so maybe Magnus will
> push
> it today (which would probably instantly break the other pg_stat_statements
> patches that are now Ready for Committer), and if not I think it should go
> to
> the next commitfest instead.
>

Dang, I missed that one while looking at the other jit patch.

It did already conflict with the patch that Michael applied, but I'll try
to clean that up quickly and apply it with this fix.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread Stephen Frost
Greetings,

On Fri, Apr 8, 2022 at 07:27 Magnus Hagander  wrote:

> On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander 
> wrote:
>
>> On Tue, Mar 29, 2022 at 10:06 PM David Rowley 
>> wrote:
>>
>>> On Wed, 30 Mar 2022 at 02:38, Robert Haas  wrote:
>>> > I think WARNING is fine. After all, the parameter is called
>>> > "jit_warn_above_fraction".
>>>
>>> I had a think about this patch.  I guess it's a little similar to
>>> checkpoint_warning. The good thing about the checkpoint_warning is
>>> that in the LOG message we give instructions about how the DBA can fix
>>> the issue, i.e increase max_wal_size.
>>>
>>> With the proposed patch I see there is no hint about what might be
>>> done to remove/reduce the warnings.  I imagine that's because it's not
>>> all that clear which GUC should be changed. In my view, likely
>>> jit_above_cost is the most relevant but there is also
>>> jit_inline_above_cost, jit_optimize_above_cost, jit_tuple_deforming
>>> and jit_expressions which are relevant too.
>>>
>>> If we go with this patch,  the problem I see here is that the amount
>>> of work the JIT compiler must do for a given query depends mostly on
>>> the number of expressions that must be compiled in the query (also to
>>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
>>> jit_tuple_deforming and jit_expressions). The DBA does not really have
>>> much control over the number of expressions in the query.  All he or
>>> she can do to get rid of the warning is something like increase
>>> jit_above_cost.  After a few iterations of that, the end result is
>>> that jit_above_cost is now high enough that JIT no longer triggers
>>> for, say, that query to that table with 1000 partitions where no
>>> plan-time pruning takes place.  Is that really a good thing? It likely
>>> means that we just rarely JIT anything at all!
>>>
>>
>> I don't agree with the conclusion of that.
>>
>> What the parameter would be useful for is to be able to tune those costs
>> (or just turn it off) *for that individual query*. That doesn't mean you
>> "rarely JIT anything atll", it just means you rarely JIT that particular
>> query.
>>
>> In fact, my goal is to specifically make people do that and *not* just
>> turn off JIT globally.
>>
>>
>> I'd much rather see us address the costing problem before adding some
>>> warning, especially a warning where it's not clear how to make go
>>> away.
>>>
>>
>> The easiest way would be to add a HINT that says turn off jit for this
>> particular query or something?
>>
>> I do agree that if we can make  "spending too much time on JIT vs query
>> runtime" go away completely, then there is no need for a parameter like
>> this.
>>
>> I still think the warning is useful. And I think it may stay useful even
>> after we have made the JIT costing smarter -- though that's not certain of
>> course.
>>
>>
> This patch is still sitting at "ready for committer".
>
> As an example, I have added such a hint in the attached.
>
> I still stand  by that this patch is better than nothing. Sure, I would
> love for us to adapt the JIT costing model and algorithm to make this not a
> problem. And once we've done that, we should remove the parameter again.
>
> It's not on by default, and it's trivial to remove in the future.
>
>
> Yes, we're right up at the deadline. I'd still like to get it in, so I'd
> really appreciate some further voices :)
>

Looks reasonable to me, so +1. The default is has it off and so I seriously
doubt it’ll cause any issues and it’ll be very handy on large and busy
systems with lots of queries finding those that have a serious amount of
time being spent in JIT (and hopefully avoid folks just turning jit off
across the board, since that’s worse- we need more data on jit and need to
work on improving it, not ending up with everyone turning it off).

Thanks!

Stephen

>


  1   2   >