Re: Large Pages and Super Pages for PostgreSQL

2022-01-15 Thread Thomas Munro
On Sun, Jan 16, 2022 at 6:03 PM DEVOPS_WwIT  wrote:
> Solaris and FreeBSD supports large/super pages, and can be used
> automatically by applications.
>
> Seems Postgres can't use the large/super pages on Solaris and FreeBSD
> os(I think can't use the large/super page HPUX and AIX), is there anyone
> could take a look?

Hello,

I can provide some clues and partial answers about page size on three
of the OSes you mentioned:

1.  Solaris:  I haven't used that OS for a long time, but I thought it
was supposed to promote memory to larger pages sizes transparently
with some heuristics.  To control page size explicitly, it *looks*
like memcntl(2) with command MHA_MAPSIZE_VA could be used; that's what
the man page says, anyway.  If someone is interested in writing a
patch to do that, I'd be happy to review it and test it on illumos...

2.  AIX:  We *nearly* made this work recently[1].  The summary is that
AIX doesn't have a way to control the page size of anonymous shared
mmap memory (our usual source of shared memory), so you have to use
SystemV shared memory if you want non-default page size for shared
memory.  We got as far as adding the option shared_memory_type=sysv,
and the next step is pretty easy: just pass in some magic flags.  This
just needs someone with access and motivation to pick up that work...

3.  FreeBSD: FreeBSD does transparently migrate PostgreSQL memory to
"super" pages quite well in my experience, but there is also a new
facility in FreeBSD 13 to ask for specific page sizes explicitly.  I
wrote a quick and dirty patch to enable PostgreSQL's huge_pages and
huge_page_size settings to work with that interface, but I haven't yet
got as far as testing it very hard or proposing it...  but here it is,
if you like experimental code[2].

I don't know about HP-UX.  I think it might be dead, Jim.

[1] 
https://www.postgresql.org/message-id/flat/HE1PR0202MB28126DB4E0B6621CC6A1A91286D90%40HE1PR0202MB2812.eurprd02.prod.outlook.com
[2] 
https://github.com/macdice/postgres/commit/a71aafe5582c2e61005af0d16ca82eed89445a67




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-15 Thread Noah Misch
On Fri, Nov 19, 2021 at 09:18:23PM -0800, Noah Misch wrote:
> On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote:
> > On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> > > Noah Misch  writes:
> > > > Each of the three failures happened on a sparc64 Debian+gcc machine.  I 
> > > > had
> > > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > > > without reproducing this.
> > 
> > > #   'pgbench: error: client 0 script 1 aborted in command 
> > > 4 query 0: ERROR:  could not read two-phase state from WAL at 0/159EF88: 
> > > unexpected pageaddr 0/0 in log segment 00010001, offset 
> > > 5890048
> > > [1] 
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-17%2013%3A01%3A24
> > 
> > Two others:
> > ERROR:  could not read two-phase state from WAL at 0/16F1850: invalid 
> > record length at 0/16F1850: wanted 24, got 0
> > -- 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-12%2013%3A01%3A15
> > ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect 
> > resource manager data checksum in record at 0/1668020
> > -- 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake=2021-11-16%2015%3A00%3A52

> > I don't have a great theory, but here are candidates to examine next:
> > 
> > - Run with wal_debug=on to confirm logged write location matches read 
> > location.
> > - Run 
> > "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
> >   pg_waldump -s 0/0100" at the end of the test.
> > - Dump WAL page binary image at the point of failure.
> > - Log which branches in XLogReadRecord() are taken.
> 
> Tom Turelinckx, are you able to provide remote access to kittiwake or
> tadarida?  I'd use it to attempt the above things.  All else being equal,
> kittiwake is more relevant since it's still supported upstream.

Thanks for setting up access.  Summary: this kernel has a bug in I/O syscalls.
How practical is it to update that kernel?  (Userland differs across these
animals, but the kernel does not.)  The kernel on buildfarm member thorntail
doesn't exhibit the bug.

For specifics of the kernel bug, see the attached test program.  In brief, the
bug arises if one process is write()ing or pwrite()ing a file at about the
same time that another process is read()ing or pread()ing the same.  POSIX
says the reader should see the data as it existed before the write or the
newly-written data.  On this kernel, the reader can see zeros instead.  That
leads to the $SUBJECT failure.  PostgreSQL processes write out a given WAL
block 20-30 times in ~10ms, and COMMIT PREPARED reads that block.  The writers
aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
the kernel bug yield the failure.  We could opt to work around that by writing
only the not-already-written portion of a WAL block, but I doubt that's
worthwhile unless it happens to be a performance win anyway.

Separately, while I don't know of relevance to PostgreSQL, I was interested to
see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

> The setup of your buildfarm animals provides a clue.  I understand kittiwake
> is the same as ibisbill except for build options, and tadarida is the same as
> mussurana except for build options.  ibisbill and mussurana haven't failed, so
> one of these is likely needed to reproduce:
> 
>   absence of --enable-cassert
>   CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security '
>   CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2'
>   LDFLAGS='-Wl,-z,relro -Wl,-z,now'

That was a red herring.  ibisbill and mussurana don't use --with-tap-tests.
Adding --with-tap-tests has been enough to make their configurations witness
the same kinds of failures.

nm
/*
 * Stress-test pread(), pwrite(), read(), and write() to detect a few problems
 * with their handling of regular files:
 *
 * - Lack of atomicity.
 *   
https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
 *   requires atomicity.  (An implementation may always return <= 1; if it
 *   chooses to return higher values, it must maintain atomicity.)
 *
 * - Transiently making readers see zeros when they read concurrently with a
 *   write, even if the file had no zero at that offset before or after the
 *   write.
 *
 * By default, this program will print "mismatch" messages if pwrite() is
 * non-atomic.  Build with other options to test other behaviors:
 * -DCHANGE_CONTENT=0 tests the zeros bug instead of plain atomicity
 * -DUSE_SEEK=1 tests write()/read() instead of pwrite()/pread()
 * -DREOPEN=0 reuses the same file descriptor across iterations
 * -DXLOG_BLCKSZ=32 tests a different byte count
 *high values may require "ulimit -Ss" changes
 *
 *
 * Observed behaviors:
 *
 * Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009):
 * pwrite/pread is non-atomic if count>16 (no -D switches)
 * write/read is atomic (-DUSE_SEEK 

Re: MultiXact/SLRU buffers configuration

2022-01-15 Thread Andrey Borodin


> 15 янв. 2022 г., в 20:46, Justin Pryzby  написал(а):
>>> 
>>> I was planning on running a set of stress tests on these patches. Could 
>>> we confirm which ones we plan to include in the commitfest?
>> 
>> Many thanks for your interest. Here's the latest version.
> 
> This is failing to compile under linux and windows due to bitfield syntax.
> http://cfbot.cputube.org/andrey-borodin.html

Uh, sorry, I formatted a patch from wrong branch.

Just tested Cirrus. It's wonderful, thanks! Really faster than doing stuff on 
my machines...

Best regards, Andrey Borodin.


v20-0001-Make-all-SLRU-buffer-sizes-configurable.patch
Description: Binary data


v20-0002-Divide-SLRU-buffers-into-8-associative-banks.patch
Description: Binary data


v20-0003-Pack-SLRU-page_number-page_status-and-page_dirty.patch
Description: Binary data


Large Pages and Super Pages for PostgreSQL

2022-01-15 Thread DEVOPS_WwIT

Hi Hacker

Solaris and FreeBSD supports large/super pages, and can be used 
automatically by applications.


Seems Postgres can't use the large/super pages on Solaris and FreeBSD 
os(I think can't use the large/super page HPUX and AIX), is there anyone 
could take a look?


following is my testing:


1. check OS supported large page size

-bash-4.3$ pagesize -a
4096
2097152
1073741824


2. the OS version is 5.11

-bash-4.3$ uname -a
SunOS 08a6a65f-b5a0-c159-f184-e81c379d1f5d 5.11 
hunghu-20220114T101258Z:a3282be5a8 i86pc i386 i86pc

-bash-4.3$


3. PostgreSQL shared buffers is 11G

-bash-4.3$ grep -i shared_buffer postgresql.conf
shared_buffers = 11GB            # min 128kB


4. checked on Solaris OS, all of the memory are 4k page for PostgreSQL , 
had not use 2M or 1G page size


-bash-4.3$ cat postmaster.pid |head -n 1
31637
-bash-4.3$ pmap -sxa 31637
31637:    /opt/local/bin/postgres
 Address Kbytes    RSS   Anon Locked Pgsz 
Mode   Mapped File
0040  4  4  -  -   4K r-x--  
postgres
00401000    872 28  -  -    - r-x--  
postgres
004DB000 84 84  -  -   4K r-x--  
postgres
004F    184 24  -  -    - r-x--  
postgres
0051E000    248    248  -  -   4K r-x--  
postgres
0055C000  8  8  -  -    - r-x--  
postgres
0055E000  8  8  -  -   4K r-x--  
postgres
0056 16 12  -  -    - r-x--  
postgres
00564000  4  4  -  -   4K r-x--  
postgres
00565000 20 20  -  -    - r-x--  
postgres
0056A000  4  4  -  -   4K r-x--  
postgres
0056B000 24 24  -  -    - r-x--  
postgres
00571000  8  8  -  -   4K r-x--  
postgres
00573000  4  4  -  -    - r-x--  
postgres
00574000 16 16  -  -   4K r-x--  
postgres
00578000 24  4  -  -    - r-x--  
postgres
0057E000  4  4  -  -   4K r-x--  
postgres
0057F000  8  8  -  -    - r-x--  
postgres
00581000  8  8  -  -   4K r-x--  
postgres
00583000  4  4  -  -    - r-x--  
postgres
00584000    188    188  -  -   4K r-x--  
postgres
005B3000 84 28  -  -    - r-x--  
postgres
005C8000 24 24  -  -   4K r-x--  
postgres
005CE000 76 40  -  -    - r-x--  
postgres
005E1000  4  4  -  -   4K r-x--  
postgres
005E2000    368    280  -  -    - r-x--  
postgres
0063E000  4  4  -  -   4K r-x--  
postgres
0063F000 80 36  -  -    - r-x--  
postgres
00653000 12 12  -  -   4K r-x--  
postgres
00656000  8  8  -  -    - r-x--  
postgres
00658000  4  4  -  -   4K r-x--  
postgres
00659000 12 12  -  -    - r-x--  
postgres
0065C000  8  8  -  -   4K r-x--  
postgres
0065E000  4  4  -  -    - r-x--  
postgres
0065F000  4  4  -  -   4K r-x--  
postgres
0066 12 12  -  -    - r-x--  
postgres
00663000  8  8  -  -   4K r-x--  
postgres
00665000 12 12  -  -    - r-x--  
postgres
00668000  4  4  -  -   4K r-x--  
postgres
00669000  4  4  -  -    - r-x--  
postgres
0066A000  8  8  -  -   4K r-x--  
postgres
0066C000 32 32  -  -    - r-x--  
postgres
00674000  4  4  -  -   4K r-x--  
postgres
00675000  4  4  -  -    - r-x--  
postgres
00676000  4  4  -  -   4K r-x--  
postgres
00677000    156    156  -  -    - r-x--  
postgres
0069E000  4  4  -  -   4K r-x--  
postgres
0069F000    416    396  -  -    - r-x--  
postgres
00707000  4 

Re: using an end-of-recovery record in all cases

2022-01-15 Thread Julien Rouhaud
Hi,

On Mon, Oct 18, 2021 at 10:56:53AM +0530, Amul Sul wrote:
> 
> Attached is the rebased and updated version. The patch removes the
> newly introduced PerformRecoveryXLogAction() function. In addition to
> that, removed the CHECKPOINT_END_OF_RECOVERY flag and its related
> code.  Also, dropped changes for bgwriter.c and slot.c in this patch, which
> seem unrelated to this work.

The cfbot reports that this version of the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3365.log
=== Applying patches on top of PostgreSQL commit ID 
0c53a6658e47217ad3dd416a5543fc87c3ecfd58 ===
=== applying patch 
./v3-0001-Always-use-an-end-of-recovery-record-rather-than-.patch
patching file src/backend/access/transam/xlog.c
[...]
Hunk #14 FAILED at 9061.
Hunk #15 FAILED at 9241.
2 out of 15 hunks FAILED -- saving rejects to file 
src/backend/access/transam/xlog.c.rej

Can you send a rebased version?  In the meantime I will switch the cf entry to
Waiting on Author.




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2022-01-15 Thread Julien Rouhaud
Hi,

On Fri, Dec 24, 2021 at 07:21:59PM +0900, Kyotaro Horiguchi wrote:
> Just a complaint..
> 
> At Fri, 12 Nov 2021 16:43:27 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > "" on Japanese (CP-932) environment. I didn't actually
> > tested it on Windows and msys environment ...yet.
> 
> Active perl cannot be installed because of (perhaps) a powershell
> version issue... Annoying..
> 
> https://community.activestate.com/t/please-update-your-powershell-install-scripts/7897

I'm not very familiar with windows, but maybe using strawberry perl instead
([1]) would fix your problem?  I think it's also quite popular and is commonly
used to run pgBadger on Windows.

Other than that, I see that the TAP tests are failing on all the environment,
due to Perl errors.  For instance:

[04:06:00.848] [04:05:54] t/003_promote.pl .
[04:06:00.848] Dubious, test returned 2 (wstat 512, 0x200)
https://api.cirrus-ci.com/v1/artifact/task/4751213722861568/tap/src/bin/pg_basebackup/tmp_check/log/regress_log_020_pg_receivewal
# Initializing node "standby" from backup "my_backup" of node "primary"
Odd number of elements in hash assignment at 
/tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
 line 996.
Use of uninitialized value in list assignment at 
/tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
 line 996.
Use of uninitialized value $tsp in concatenation (.) or string at 
/tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
 line 1008.
Use of uninitialized value $tsp in concatenation (.) or string at 
/tmp/cirrus-ci-build/src/bin/pg_ctl/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
 line 1009.

That's apparently the same problem on every failure reported.

Can you send a fixed patchset?  In the meantime I will switch the cf entry to
Waiting on Author.


[1] https://strawberryperl.com/




Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit

2022-01-15 Thread Julien Rouhaud
Hi,

On Thu, Jan 06, 2022 at 05:29:23PM +0900, Etsuro Fujita wrote:
> 
> Done.  Attached is a new version.

The patchset doesn't apply anymore:
http://cfbot.cputube.org/patch_36_3392.log
=== Applying patches on top of PostgreSQL commit ID 
43c2175121c829c8591fc5117b725f1f22bfb670 ===
=== applying patch ./v2-0001-postgres-fdw-Add-support-for-parallel-commit.patch
patching file contrib/postgres_fdw/connection.c
patching file contrib/postgres_fdw/expected/postgres_fdw.out
Hunk #2 FAILED at 10825.
1 out of 2 hunks FAILED -- saving rejects to file 
contrib/postgres_fdw/expected/postgres_fdw.out.rej
patching file contrib/postgres_fdw/option.c
patching file contrib/postgres_fdw/sql/postgres_fdw.sql
Hunk #1 FAILED at 3452.
1 out of 1 hunk FAILED -- saving rejects to file 
contrib/postgres_fdw/sql/postgres_fdw.sql.rej
patching file doc/src/sgml/postgres-fdw.sgml

I also see that Fuji-san raised some questions, so for now I will simply change
the patch status to Waiting on Author.




Re: Isn't wait_for_catchup slightly broken?

2022-01-15 Thread Tom Lane
I wrote:
> Another thing that is bothering me a bit is that a number of the
> callers use $node->lsn('insert') as the target.  This also seems
> rather dubious, because that could be ahead of what's been written
> out.  These callers are just taking it on faith that something will
> eventually cause that extra WAL to get written out (and become
> available to the standby).  Again, that seems to make the test
> slower than it need be, with a worst-case scenario being that it
> eventually times out.  Admittedly this is unlikely to be a big
> problem unless some background op issues an abortive transaction
> at just the wrong time.  Nonetheless, I wonder if we shouldn't
> standardize on "thou shalt use the write position", because I
> don't think the other alternatives have anything to recommend them.

Here's a version that makes sure that callers specify a write position not
an insert position.  I also simplified the callers wherever it turned
out that they could just use the default parameters.

regards, tom lane

diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index b616c0ccf1..98dbdab595 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -290,7 +290,7 @@ $standby->psql(
 	"CREATE_REPLICATION_SLOT $archive_slot PHYSICAL (RESERVE_WAL)",
 	replication => 1);
 # Wait for standby catchup
-$primary->wait_for_catchup($standby, 'replay', $primary->lsn('write'));
+$primary->wait_for_catchup($standby);
 # Get a walfilename from before the promotion to make sure it is archived
 # after promotion
 my $standby_slot = $standby->slot($archive_slot);
diff --git a/src/bin/pg_rewind/t/007_standby_source.pl b/src/bin/pg_rewind/t/007_standby_source.pl
index 62254344f6..239b510c1a 100644
--- a/src/bin/pg_rewind/t/007_standby_source.pl
+++ b/src/bin/pg_rewind/t/007_standby_source.pl
@@ -74,7 +74,7 @@ $node_a->safe_psql('postgres',
 	"INSERT INTO tbl1 values ('in A, before promotion')");
 $node_a->safe_psql('postgres', 'CHECKPOINT');
 
-my $lsn = $node_a->lsn('insert');
+my $lsn = $node_a->lsn('write');
 $node_a->wait_for_catchup('node_b', 'write', $lsn);
 $node_b->wait_for_catchup('node_c', 'write', $lsn);
 
@@ -93,8 +93,7 @@ $node_a->safe_psql('postgres',
 	"INSERT INTO tbl1 VALUES ('in A, after C was promoted')");
 
 # make sure it's replicated to B before we continue
-$lsn = $node_a->lsn('insert');
-$node_a->wait_for_catchup('node_b', 'replay', $lsn);
+$node_a->wait_for_catchup('node_b');
 
 # Also insert a new row in the standby, which won't be present in the
 # old primary.
@@ -161,7 +160,7 @@ in A, after C was promoted
 $node_a->safe_psql('postgres',
 	"INSERT INTO tbl1 values ('in A, after rewind')");
 
-$lsn = $node_a->lsn('insert');
+$lsn = $node_a->lsn('write');
 $node_b->wait_for_catchup('node_c', 'replay', $lsn);
 
 check_query(
diff --git a/src/bin/pg_rewind/t/008_min_recovery_point.pl b/src/bin/pg_rewind/t/008_min_recovery_point.pl
index 9ad7e6b62b..8240229230 100644
--- a/src/bin/pg_rewind/t/008_min_recovery_point.pl
+++ b/src/bin/pg_rewind/t/008_min_recovery_point.pl
@@ -69,8 +69,7 @@ $node_3->init_from_backup($node_1, $backup_name, has_streaming => 1);
 $node_3->start;
 
 # Wait until node 3 has connected and caught up
-my $lsn = $node_1->lsn('insert');
-$node_1->wait_for_catchup('node_3', 'replay', $lsn);
+$node_1->wait_for_catchup('node_3');
 
 #
 # Swap the roles of node_1 and node_3, so that node_1 follows node_3.
@@ -106,8 +105,7 @@ $node_2->restart();
 #
 
 # make sure node_1 is full caught up with node_3 first
-$lsn = $node_3->lsn('insert');
-$node_3->wait_for_catchup('node_1', 'replay', $lsn);
+$node_3->wait_for_catchup('node_1');
 
 $node_1->promote;
 # Force a checkpoint after promotion, like earlier.
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index e18f27276c..eceb8b2d8b 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2509,8 +2509,11 @@ poll_query_until timeout.
 
 Requires that the 'postgres' db exists and is accessible.
 
-target_lsn may be any arbitrary lsn, but is typically $primary_node->lsn('insert').
-If omitted, pg_current_wal_lsn() is used.
+The default value of target_lsn is $node->lsn('write'), which ensures
+that the standby has caught up to what has been committed on the primary.
+If you pass an explicit value of target_lsn, it should almost always be
+the primary's write LSN; so this parameter is seldom needed except when
+querying some intermediate replication node rather than the primary.
 
 This is not a test. It die()s on failure.
 
@@ -2531,23 +2534,18 @@ sub wait_for_catchup
 	{
 		$standby_name = $standby_name->name;
 	}
-	my $lsn_expr;
-	if (defined($target_lsn))
+	if (!defined($target_lsn))
 	{
-		$lsn_expr = "'$target_lsn'";
-	}
-	else
-	{
-		$lsn_expr = 'pg_current_wal_lsn()';
+		$target_lsn = $self->lsn('write');
 	}
 	print 

Re: sequences vs. synchronous replication

2022-01-15 Thread Tomas Vondra

On 1/15/22 06:12, Fujii Masao wrote:



On 2022/01/12 1:07, Tomas Vondra wrote:

I explored the idea of using page LSN a bit


Many thanks!



The patch from 22/12 simply checks if the change should/would wait for
sync replica, and if yes it WAL-logs the sequence increment. There's a
couple problems with this, unfortunately:


Yes, you're right.



So I think this approach is not really an improvement over WAL-logging
every increment. But there's a better way, I think - we don't need to
generate WAL, we just need to ensure we wait for it to be flushed at
transaction end in RecordTransactionCommit().

That is, instead of generating more WAL, simply update XactLastRecEnd
and then ensure RecordTransactionCommit flushes/waits etc. Attached is a
patch doing that - the changes in sequence.c are trivial, changes in
RecordTransactionCommit simply ensure we flush/wait even without XID
(this actually raises some additional questions that I'll discuss in a
separate message in this thread).


This approach (and also my previous proposal) seems to assume that the 
value returned from nextval() should not be used until the transaction 
executing that nextval() has been committed successfully. But I'm not 
sure how many applications follow this assumption. Some application 
might use the return value of nextval() instantly before issuing commit 
command. Some might use the return value of nextval() executed in 
rollbacked transaction.




IMO any application that assumes data from uncommitted transactions is 
outright broken and we should not try to fix that because it's quite 
futile (and likely will affect well-behaving applications).


The issue I'm trying to fix in this thread is much narrower - we don't 
actually meet the guarantees for committed transactions (that only did 
nextval without generating any WAL).


If we want to avoid duplicate sequence value even in those cases, ISTM 
that the transaction needs to wait for WAL flush and sync rep before 
nextval() returns the value. Of course, this might cause other issues 
like performance decrease, though.




Right, something like that. But that'd hurt well-behaving applications, 
because by doing the wait earlier (in nextval, not at commit) it 
increases the probability of waiting.


FWIW I'm not against improvements in this direction, but it goes way 
beyong fixing the original issue.





On btrfs, it looks like this (the numbers next to nextval are the cache
size, with 1 being the default):

   client  test master   log-all  page-lsn   log-all  page-lsn
   ---
    1  insert  829   807   802   97%   97%
   nextval/1 16491   814 16465    5%  100%
   nextval/32    24487 16462 24632   67%  101%
   nextval/64    24516 24918 24671  102%  101%
   nextval/128   32337 33178 32863  103%  102%

   client  test master   log-all  page-lsn   log-all  page-lsn
   ---
    4  insert 1577  1590  1546  101%   98%
   nextval/1 45607  1579 21220    3%   47%
   nextval/32    68453 49141 51170   72%   75%
   nextval/64    66928 65534 66408   98%   99%
   nextval/128   83502 81835 82576   98%   99%

The results seem clearly better, I think.


Thanks for benchmarking this! I agree that page-lsn is obviously better 
than log-all.




For "insert" there's no drop at all (same as before), because as soon as
a transaction generates any WAL, it has to flush/wait anyway.

And for "nextval" there's a drop, but only with 4 clients, and it's much
smaller (53% instead of 97%). And increasing the cache size eliminates
even that.


Yes, but 53% drop would be critial for some applications that don't want 
to increase the cache size for some reasons. So IMO it's better to 
provide the option to enable/disable that page-lsn approach.




I disagree. This drop applies only to extremely simple transactions - 
once the transaction does any WAL write, it disappears. Even if the 
transaction does only a couple reads, it'll go away. I find it hard to 
believe there's any serious application doing this.


So I think we should get it reliable (to not lose data after commit) 
first and then maybe see if we can improve this.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: extended stats on partitioned tables

2022-01-15 Thread Tomas Vondra




On 1/15/22 19:35, Tomas Vondra wrote:

On 1/15/22 06:11, Justin Pryzby wrote:

On Mon, Dec 13, 2021 at 09:40:09PM +0100, Tomas Vondra wrote:

1) If the table is a separate relation (not part of an inheritance
tree), this should make no difference. -> OK

2) If the table is using "old" inheritance, this reverts back to
pre-regression behavior. So people will keep using the old statistics
until the ANALYZE, and we need to tell them to ANALYZE or something.

3) If the table is using partitioning, it's guaranteed to be empty and
there are no stats at all. Again, we should tell people to run ANALYZE.


I think these can be mentioned in the commit message, which can end up 
in the

minor release notes as a recommendation to rerun ANALYZE.



Good point. I pushed the 0002 part and added a short paragraph 
suggesting ANALYZE might be necessary. I did not go into details about 
the individual cases, because that'd be too much for a commit message.



Thanks for pushing 0001.



Thanks for posting the patches!

I've pushed the second part - attached are the two remaining parts. I'll 
wait a bit before pushing the rebased 0001 (which goes into master 
branch only). Not sure about 0002 - I'm not convinced the refactored ACL 
checks are an improvement, but I'll think about it.




BTW when backpatching the first part, I had to decide what to do about 
tests. The 10 & 11 branches did not have the check_estimated_rows() 
function. I considered removing the tests, reworking the tests not to 
need the function, or adding the function. Ultimately I added the 
function, which seemed like the best option.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Windows: Wrong error message at connection termination

2022-01-15 Thread Tom Lane
Sergey Shinderuk  writes:
> On 14.01.2022 13:01, Sergey Shinderuk wrote:
>> Unexpectedly, this changes the error message:
> ...
> For the record, after more poking I realized that it depends on timing. 
>   By injecting delays I can get any of the following from libpq:
> * could not receive data from server: Software caused connection abort
> * server closed the connection unexpectedly
> * no connection to the server

Thanks for the follow-up.  At the moment I'm not planning to do anything
pending the results of the other thread [1].  It seems likely though that
we'll end up reverting this explicit-close behavior in the back branches,
as the other changes involved look too invasive for back-patching.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com




Re: Support tab completion for upper character inputs in psql

2022-01-15 Thread Tom Lane
Peter Eisentraut  writes:
> The rest of the patch seems ok in principle, since AFAICT enums are the 
> only query result in tab-complete.c that are not identifiers and thus 
> subject to case issues.

I spent some time looking at this patch.  I'm not very happy with it,
for two reasons:

1. The downcasing logic in the patch bears very little resemblance
to the backend's actual downcasing logic, which can be found in
src/backend/parser/scansup.c's downcase_identifier().  Notably,
the patch's restriction to only convert all-ASCII strings seems
indefensible, because that's not how things really work.  I fear
we can't always exactly duplicate the backend's behavior, because
it's dependent on the server's locale and encoding; but I think
we should at least get it right in the common case where psql is
using the same locale and encoding as the server.

2. I don't think there's been much thought about the larger picture
of what is to be accomplished.  Right now, we successfully
tab-complete inputs that are prefixes of the canonical spelling (per
quote_identifier) of the object's name, and don't try at all for
non-canonical spellings.  I'm on board with trying to allow some of
the latter but I'm not sure that this patch represents much forward
progress.  To be definite about it, suppose we have a DB containing
just two tables whose names start with "m", say mytab and mixedTab.
Then:

(a) m immediately completes mytab, ignoring mixedTab

(b) "m immediately completes "mixedTab", ignoring mytab

(c) "my fails to find anything

(d) mi fails to find anything

(e) M fails to find anything

This patch proposes to improve case (e), but to my taste cases (a)
through (c) are much bigger problems.  It'd be nice if (d) worked too
--- that'd require injecting a double-quote where the user had not
typed one, but we already do the equivalent thing with single-quotes
for file names, so why not?  (Although after fighting with readline
yesterday to try to get it to handle single-quoted enum labels sanely,
I'm not 100% sure if (d) is possible.)

Also, even for case (e), what we have with this patch is that it
immediately completes mytab, ignoring mixedTab.  Is that what we want?
Another example is that miX fails to find anything, which seems
like a POLA violation given that mY completes to mytab.

I'm not certain how many of these alternatives can be supported
without introducing ambiguity that wasn't there before (which'd
manifest as failing to complete in cases where the existing code
chooses an alternative just fine).  But I really don't like the
existing behavior for (b) and (c) --- I should be able to spell
a name with double quotes if I want, without losing completion
support.

BTW, another thing that maybe we should think about is how this
interacts with the pattern matching capability in \d and friends.
If people can tab-complete non-canonical spellings, they might
expect the same spellings to work in \d.  I don't say that this
patch has to fix that, but we might want to look and be sure we're
not painting ourselves into a corner (especially since I see
that we already perform tab-completion in that context).

regards, tom lane




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-01-15 Thread Justin Pryzby
On Sat, Jan 15, 2022 at 08:39:14PM +0300, Michail Nikolaev wrote:
> Hello, Junien.
> 
> Thanks for your attention.
> 
> > The cfbot reports that this patch is currently failing at least on
> > Linux and Windows, e.g. https://cirrus-ci.com/task/6532060239101952.
> 
> Fixed. It was the issue with the test - hangs on Windows because of
> psql + spurious vacuum sometimes.

It looks like there's still a server crash caused the CI or client to hang.

https://cirrus-ci.com/task/6350310141591552
2022-01-13 06:31:04.182 GMT [8636][walreceiver] FATAL:  could not receive data 
from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2022-01-13 06:31:04.182 GMT [6848][startup] LOG:  invalid record length at 
0/3014B58: wanted 24, got 0
2022-01-13 06:31:04.228 GMT [8304][walreceiver] FATAL:  could not connect to 
the primary server: connection to server on socket 
"C:/Users/ContainerAdministrator/AppData/Local/Temp/_7R9Pa5CwW/.s.PGSQL.58307" 
failed: Connection refused (0x274D/10061)




Re: extended stats on partitioned tables

2022-01-15 Thread Tomas Vondra

On 1/15/22 06:11, Justin Pryzby wrote:

On Mon, Dec 13, 2021 at 09:40:09PM +0100, Tomas Vondra wrote:

1) If the table is a separate relation (not part of an inheritance
tree), this should make no difference. -> OK

2) If the table is using "old" inheritance, this reverts back to
pre-regression behavior. So people will keep using the old statistics
until the ANALYZE, and we need to tell them to ANALYZE or something.

3) If the table is using partitioning, it's guaranteed to be empty and
there are no stats at all. Again, we should tell people to run ANALYZE.


I think these can be mentioned in the commit message, which can end up in the
minor release notes as a recommendation to rerun ANALYZE.



Good point. I pushed the 0002 part and added a short paragraph 
suggesting ANALYZE might be necessary. I did not go into details about 
the individual cases, because that'd be too much for a commit message.



Thanks for pushing 0001.



Thanks for posting the patches!

I've pushed the second part - attached are the two remaining parts. I'll 
wait a bit before pushing the rebased 0001 (which goes into master 
branch only). Not sure about 0002 - I'm not convinced the refactored ACL 
checks are an improvement, but I'll think about it.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom e065e54d1961a4e295ebe77febbe6a3307d142b5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 12 Dec 2021 00:07:27 +0100
Subject: [PATCH 1/2] Add stxdinherit flag to pg_statistic_ext_data

The support for extended statistics on inheritance trees was somewhat
problematic, because the catalog pg_statistic_ext_data did not have a
flag identifying whether the data was built with/without data for the
child relations.

For partitioned tables we've been able to work around this because the
non-leaf relations can't contain data, so there's really just one set of
statistics to store. But for regular inheritance trees we had to pick,
and there is no clearly better option - we need both.

This introduces the pg_statistic_ext_data.stxdinherit flag, analogous to
pg_statistic.stainherit, and modifies analyze to build statistics for
both cases.

This relaxes the relationship between the two catalogs - until now we've
created the pg_statistic_ext_data one when creating the statistics, and
then only updated it later. There was always 1:1 relationship between
rows in the two catalogs. With this change we no longer insert any data
rows while creating statistics, because we don't know which flag value
to create, and it seems wasteful to initialize both for every relation.
The there may be 0, 1 or 2 data rows for each statistics definition.

Patch by me, with extensive improvements and fixes by Justin Pryzby.

Author: Tomas Vondra, Justin Pryzby
Reviewed-by: Tomas Vondra, Justin Pryzby
Discussion: https://postgr.es/m/20210923212624.GI831%40telsasoft.com
---
 doc/src/sgml/catalogs.sgml  |  23 +++
 src/backend/catalog/system_views.sql|   2 +
 src/backend/commands/analyze.c  |  28 +---
 src/backend/commands/statscmds.c|  72 +-
 src/backend/optimizer/util/plancat.c| 147 
 src/backend/statistics/dependencies.c   |  22 ++-
 src/backend/statistics/extended_stats.c |  75 +-
 src/backend/statistics/mcv.c|   9 +-
 src/backend/statistics/mvdistinct.c |   5 +-
 src/backend/utils/adt/selfuncs.c|  37 +
 src/backend/utils/cache/syscache.c  |   6 +-
 src/include/catalog/pg_statistic_ext_data.h |   4 +-
 src/include/commands/defrem.h   |   1 +
 src/include/nodes/pathnodes.h   |   1 +
 src/include/statistics/statistics.h |  11 +-
 src/test/regress/expected/rules.out |   2 +
 src/test/regress/expected/stats_ext.out |  31 +++--
 src/test/regress/sql/stats_ext.sql  |  13 +-
 18 files changed, 254 insertions(+), 235 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 03e2537b07d..2aeb2ef346e 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7521,6 +7521,19 @@ SCRAM-SHA-256$iteration count:
created with CREATE STATISTICS.
   
 
+  
+   Normally there is one entry, with stxdinherit =
+   false, for each statistics object that has been analyzed.
+   If the table has inheritance children, a second entry with
+   stxdinherit = true is also created.
+   This row represents the statistics object over the inheritance tree, i.e.,
+   statistics for the data you'd see with
+   SELECT * FROM table*,
+   whereas the stxdinherit = false row
+   represents the results of
+   SELECT * FROM ONLY table.
+  
+
   
Like pg_statistic,
pg_statistic_ext_data should not be
@@ -7560,6 +7573,16 @@ SCRAM-SHA-256$iteration count:
   
  
 
+ 
+  
+   stxdinherit bool
+  
+  
+   If true, the stats include inheritance child columns, not just the

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-01-15 Thread Michail Nikolaev
Hello, Junien.

Thanks for your attention.

> The cfbot reports that this patch is currently failing at least on
> Linux and Windows, e.g. https://cirrus-ci.com/task/6532060239101952.

Fixed. It was the issue with the test - hangs on Windows because of
psql + spurious vacuum sometimes.

> I'm switching this patch on Waiting on Author.

I have tested it multiple times on my Github repo, seems to be stable now.
Switching back to Ready for committer.

Best regards.
Michail.
From 9372bac9b56d27cf993e9d1fa66127c86b51f25c Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sat, 15 Jan 2022 16:21:51 +0300
Subject: [PATCH v7 1/3] code

---
 src/backend/access/common/bufmask.c  | 25 
 src/backend/access/gist/gistget.c| 43 +++--
 src/backend/access/gist/gistxlog.c   | 15 +
 src/backend/access/hash/hash.c   |  4 +-
 src/backend/access/hash/hash_xlog.c  | 17 +
 src/backend/access/hash/hashsearch.c | 18 --
 src/backend/access/hash/hashutil.c   | 33 +-
 src/backend/access/heap/heapam.c | 42 +---
 src/backend/access/heap/heapam_handler.c |  5 +-
 src/backend/access/index/genam.c | 20 +++---
 src/backend/access/index/indexam.c   | 81 +---
 src/backend/access/nbtree/nbtinsert.c| 22 +--
 src/backend/access/nbtree/nbtree.c   |  4 +-
 src/backend/access/nbtree/nbtsearch.c| 14 +++-
 src/backend/access/nbtree/nbtutils.c | 33 +-
 src/backend/access/nbtree/nbtxlog.c  | 16 +
 src/backend/access/table/tableam.c   |  4 +-
 src/backend/access/transam/rmgr.c|  4 +-
 src/backend/access/transam/xlogutils.c   |  6 ++
 src/backend/storage/ipc/standby.c|  6 ++
 src/bin/pg_rewind/parsexlog.c|  2 +-
 src/bin/pg_waldump/rmgrdesc.c|  2 +-
 src/include/access/bufmask.h |  1 +
 src/include/access/gist.h|  5 ++
 src/include/access/gistxlog.h|  1 +
 src/include/access/hash.h|  2 +
 src/include/access/hash_xlog.h   |  1 +
 src/include/access/heapam.h  |  2 +-
 src/include/access/nbtree.h  |  2 +
 src/include/access/nbtxlog.h |  1 +
 src/include/access/relscan.h | 15 -
 src/include/access/rmgr.h|  2 +-
 src/include/access/rmgrlist.h| 46 +++---
 src/include/access/tableam.h | 14 ++--
 src/include/access/xlog_internal.h   |  4 ++
 35 files changed, 422 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/common/bufmask.c b/src/backend/access/common/bufmask.c
index 4e953bfd61..22026482ad 100644
--- a/src/backend/access/common/bufmask.c
+++ b/src/backend/access/common/bufmask.c
@@ -128,3 +128,28 @@ mask_page_content(Page page)
 	memset(&((PageHeader) page)->pd_upper, MASK_MARKER,
 		   sizeof(uint16));
 }
+
+/*
+ * mask_lp_dead
+ *
+ * In some index AMs, line pointer flags can be modified without emitting any
+ * WAL record. Sometimes it is required to mask LP_DEAD flags set on primary to
+ * set own values on standby.
+ */
+void
+mask_lp_dead(Page page)
+{
+	OffsetNumber offnum,
+ maxoff;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdHasStorage(itemId) && ItemIdIsDead(itemId))
+			itemId->lp_flags = LP_NORMAL;
+	}
+}
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index adbf622c83..1905c04c51 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "access/bufmask.h"
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/relscan.h"
@@ -49,6 +50,7 @@ gistkillitems(IndexScanDesc scan)
 	Assert(so->curBlkno != InvalidBlockNumber);
 	Assert(!XLogRecPtrIsInvalid(so->curPageLSN));
 	Assert(so->killedItems != NULL);
+	Assert(so->numKilled > 0);
 
 	buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
 	if (!BufferIsValid(buffer))
@@ -62,8 +64,13 @@ gistkillitems(IndexScanDesc scan)
 	 * If page LSN differs it means that the page was modified since the last
 	 * read. killedItems could be not valid so LP_DEAD hints applying is not
 	 * safe.
+	 *
+	 * Another case - standby was promoted after start of current transaction.
+	 * It is not required for correctness, but it is better to just skip
+	 * everything.
 	 */
-	if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
+	if ((BufferGetLSNAtomic(buffer) != so->curPageLSN) ||
+			(scan->xactStartedInRecovery && !RecoveryInProgress()))
 	{
 		UnlockReleaseBuffer(buffer);
 		so->numKilled = 0;		/* reset counter */
@@ -71,6 +78,20 @@ gistkillitems(IndexScanDesc scan)
 	}
 
 	Assert(GistPageIsLeaf(page));
+	if (GistPageHasLpSafeOnStandby(page) && !scan->xactStartedInRecovery)
+	{
+		/* Seems like server was promoted some time ago,
+		 

Re: MultiXact/SLRU buffers configuration

2022-01-15 Thread Justin Pryzby
On Sat, Jan 15, 2022 at 12:16:59PM +0500, Andrey Borodin wrote:
> > 15 янв. 2022 г., в 03:20, Shawn Debnath  написал(а):
> > On Fri, Jan 14, 2022 at 05:28:38PM +0800, Julien Rouhaud wrote:
> >>> PFA rebase of the patchset. Also I've added a patch to combine 
> >>> page_number, page_status, and page_dirty together to touch less 
> >>> cachelines.
> >> 
> >> The cfbot reports some errors on the latest version of the patch:
> >> 
> >> https://cirrus-ci.com/task/6121317215764480
> >> [...]
> >> Could you send a new version?  In the meantime I will switch the patch 
> >> status to Waiting on Author.
> > 
> > I was planning on running a set of stress tests on these patches. Could 
> > we confirm which ones we plan to include in the commitfest?
> 
> Many thanks for your interest. Here's the  latest version.

This is failing to compile under linux and windows due to bitfield syntax.
http://cfbot.cputube.org/andrey-borodin.html

And compile warnings:

slru.c: In function ‘SlruAdjustNSlots’:
slru.c:161:2: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]
  161 |  int nbanks = 1;
  |  ^~~
slru.c: In function ‘SimpleLruReadPage_ReadOnly’:
slru.c:530:2: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]
  530 |  int bankstart = (pageno & shared->bank_mask) * shared->bank_size;
  |  ^~~

Note that you can test the CI result using any github account without waiting
for the cfbot.  See ./src/tools/ci/README.

-- 
Justin




Re: Refactoring of compression options in pg_basebackup

2022-01-15 Thread Magnus Hagander
On Fri, Jan 14, 2022 at 10:53 PM Robert Haas  wrote:
>
> On Thu, Jan 13, 2022 at 10:23 PM Michael Paquier  wrote:
> > Using --compression-level=NUMBER and --server-compress=METHOD to
> > specify a server-side compression method with a level is fine by me,
> > but I find the reuse of --compress to specify a compression method
> > confusing as it maps with the past option we have kept in
> > pg_basebackup for a couple of years now.  Based on your suggested set
> > of options, we could then have a --client-compress=METHOD and
> > --compression-level=NUMBER to specify a client-side compression method
> > with a level.  If we do that, I guess that we should then:
> > 1) Block the combination of --server-compress and --client-compress.
> > 2) Remove the existing -Z/--compress and -z/--gzip.
>
> I could live with that. I'm not sure that --client-compress instead of
> reusing --compress is going to be better ... but I don't think it's
> awful so much as just not my first choice. I also don't think it would
> be horrid to leave -z, --gzip, and -Z as shorthands for the
> --client-compress=gzip with --compression-level also in the last case,
> instead of removing all that stuff.

It never makes sense to compress *both* in server and client, right?

One argument in that case for using --compress would be that we could
have that one take options like --compress=gzip (use gzip in the
client) and --compress=server-lz4 (use lz4 on the server), and
automatically make it impossible to do both. And maybe also accept
--compress=client-gzip (which would be the same as just specifying
gzip).

That would be an argument for actually keeping --compress and not
using --client-compress, because obviously it would be silly to have
--client-compress=server-lz4...

And yes, I agree that considering both server and client compression
even if we don't have server compression makes sense, since we don't
want to change things around again when we get it.

We could perhaps also consider accepting --compress=gzip:7
(:) as a way to specify the level, for both client and
server side.

I think having --client-compress and --server-compress separately but
having --compression-level *not* being separate would be confusing and
I *think* that's what the current patch proposes?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-01-15 Thread Bharath Rupireddy
On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Sat, Jan 15, 2022 at 02:04:12PM +0530, Bharath Rupireddy wrote:
> >
> > We had an issue where there were many mapping files generated during
> > the crash recovery and end-of-recovery checkpoint was taking a lot of
> > time. We had to manually intervene and delete some of the mapping
> > files (although it may not sound sensible) to make end-of-recovery
> > checkpoint faster. Because of the race condition between manual
> > deletion and checkpoint deletion, the unlink error occurred which
> > crashed the server and the server entered the recovery again wasting
> > the entire earlier recovery work.
>
> Maybe I'm missing something but wouldn't
> https://commitfest.postgresql.org/36/3448/ better solve the problem?

The error can cause the new background process proposed there in that
thread to restart, which is again costly. Since we have LOG-only and
continue behavior in CheckPointSnapBuild already, having the same
behavior for CheckPointLogicalRewriteHeap helps a lot.

Regards,
Bharath Rupireddy.




Re: Skipping logical replication transactions on subscriber side

2022-01-15 Thread Amit Kapila
On Fri, Jan 14, 2022 at 5:35 PM vignesh C  wrote:
>
> Thanks for the updated patch, few minor comments:
> 1) Should "SKIP" be "SKIP (" here:
> @@ -1675,7 +1675,7 @@ psql_completion(const char *text, int start, int end)
> /* ALTER SUBSCRIPTION  */
> else if (Matches("ALTER", "SUBSCRIPTION", MatchAny))
> COMPLETE_WITH("CONNECTION", "ENABLE", "DISABLE", "OWNER TO",
> - "RENAME TO", "REFRESH
> PUBLICATION", "SET",
> + "RENAME TO", "REFRESH
> PUBLICATION", "SET", "SKIP",
>

Won't the another rule as follows added by patch sufficient for what
you are asking?
+ /* ALTER SUBSCRIPTION  SKIP */
+ else if (Matches("ALTER", "SUBSCRIPTION", MatchAny, "SKIP"))
+ COMPLETE_WITH("(");

I might be missing something but why do you think the handling of SKIP
be any different than what we are doing for SET?

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2022-01-15 Thread Amit Kapila
On Fri, Jan 14, 2022 at 7:49 AM Masahiko Sawada  wrote:
>
> I agree with all the comments above. I've attached an updated patch.
>

Review comments

1.
+
+  
+   In this case, you need to consider changing the data on the
subscriber so that it

The starting of this sentence doesn't make sense to me. How about
changing it like: "To resolve conflicts, you need to ...

2.
+  
pg_subscription.subskipxid)
+  is cleared.  See  for
+  the details of logical replication conflicts.
+ 
+
+ 
+  skip_option specifies options for
this operation.
+  The supported option is:
+
+  
+   
+xid (xid)
+
+ 
+  Specifies the ID of the transaction whose changes are to be skipped
+  by the logical replication worker. Setting
NONE resets
+  the transaction ID.
+ 

Empty spaces after line finish are inconsistent. I personally use a
single space before a new line but I see that others use two spaces
and the nearby documentation also uses two spaces in this regard so I
am fine either way but let's be consistent.

3.
+ case ALTER_SUBSCRIPTION_SKIP:
+ {
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser to skip transaction")));
+
+ parse_subscription_options(pstate, stmt->options, SUBOPT_XID, );
+
+ if (IsSet(opts.specified_opts, SUBOPT_XID))
..
..

Is there a case when the above 'if (IsSet(..' won't be true? If not,
then probably there should be Assert instead of 'if'.

4.
+static TransactionId skipping_xid = InvalidTransactionId;

I find this variable name bit odd. Can we name it skip_xid?

5.
+ * skipping_xid is valid if we are skipping all data modification changes
+ * (INSERT, UPDATE, etc.) of the specified transaction at
MySubscription->skipxid.
+ * Once we start skipping changes, we don't stop it until we skip all changes

I think it would be better to write the first line of comment as: "We
enable skipping all data modification changes (INSERT, UPDATE, etc.)
for the subscription if the user has specified skip_xid. Once we ..."

6.
+static void
+maybe_start_skipping_changes(TransactionId xid)
+{
+ Assert(!is_skipping_changes());
+ Assert(!in_remote_transaction);
+ Assert(!in_streamed_transaction);
+
+ /* Make sure subscription cache is up-to-date */
+ maybe_reread_subscription();

Why do we need to update the cache here by calling
maybe_reread_subscription() and at other places in the patch? It is
sufficient to get the skip_xid value at the start of the worker via
GetSubscription().

7. In maybe_reread_subscription(), isn't there a need to check whether
skip_xid is changed where we exit and launch the worker and compare
other subscription parameters?

8.
+static void
+clear_subscription_skip_xid(TransactionId xid, XLogRecPtr origin_lsn,
+ TimestampTz origin_timestamp)
+{
+ Relation rel;
+ Form_pg_subscription subform;
+ HeapTuple tup;
+ bool nulls[Natts_pg_subscription];
+ bool replaces[Natts_pg_subscription];
+ Datum values[Natts_pg_subscription];
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, false, sizeof(nulls));
+ memset(replaces, false, sizeof(replaces));
+
+ if (!IsTransactionState())
+ StartTransactionCommand();
+
+ LockSharedObject(SubscriptionRelationId, MySubscription->oid, 0,
+ AccessShareLock);

It is important to add a comment as to why we need a lock here.

9.
+ * needs to be set subskipxid again.  We can reduce the possibility by
+ * logging a replication origin WAL record to advance the origin LSN
+ * instead but it doesn't seem to be worth since it's a very minor case.

You can also add here that there is no way to advance origin_timestamp
so that would be inconsistent.

10.
+clear_subscription_skip_xid(TransactionId xid, XLogRecPtr origin_lsn,
+ TimestampTz origin_timestamp)
{
..
..
+ if (!IsTransactionState())
+ StartTransactionCommand();
..
..
+ CommitTransactionCommand();
..
}

The transaction should be committed in this function if it is started
here otherwise it should be the responsibility of the caller to commit
it.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Allow multiple recursive self-references

2022-01-15 Thread Denis Hirn
> On 14. Jan 2022, at 13:21, Peter Eisentraut 
 wrote:

>
> There is nothing in there that says that certain branches of the 
UNION in a recursive query mean certain things. In fact, it doesn't even 
require the query to contain a UNION at all.  It just says to iterate on 
evaluating the query until a fixed point is reached.  I think this 
supports my claim that the associativity and commutativity of a UNION in 
a recursive query still apply.

>
> This is all very complicated, so I don't claim this to be 
authoritative, but I just don't see anything in the spec that supports 
what you are saying.


Please also have a look at SQL:2016, 7.16  General 
Rules 2) c),
which defines the evaluation semantics of recursive queries. I think 
that this part

of the SQL standard refutes your argument.

Best,
  -- Denis




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-01-15 Thread Julien Rouhaud
Hi,

On Sat, Jan 15, 2022 at 02:04:12PM +0530, Bharath Rupireddy wrote:
> 
> We had an issue where there were many mapping files generated during
> the crash recovery and end-of-recovery checkpoint was taking a lot of
> time. We had to manually intervene and delete some of the mapping
> files (although it may not sound sensible) to make end-of-recovery
> checkpoint faster. Because of the race condition between manual
> deletion and checkpoint deletion, the unlink error occurred which
> crashed the server and the server entered the recovery again wasting
> the entire earlier recovery work.

Maybe I'm missing something but wouldn't
https://commitfest.postgresql.org/36/3448/ better solve the problem?




Re: psql - add SHOW_ALL_RESULTS option

2022-01-15 Thread Fabien COELHO


Hello Andres,


The reason this test constantly fails on cfbot windows is a use-after-free
bug.


Indeed! Thanks a lot for the catch and the debug!

The ClearOrSaveResult function is quite annoying because it may or may not 
clear the result as a side effect.


Attached v14 moves the status extraction before the possible clear. I've 
added a couple of results = NULL after such calls in the code.


--
Fabien.diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index e0abe34bb6..8f7f93172a 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -50,8 +50,28 @@ BEGIN \;
 SELECT 2.0 AS "float" \;
 SELECT 'world' AS "text" \;
 COMMIT;
+ float 
+---
+   2.0
+(1 row)
+
+ text  
+---
+ world
+(1 row)
+
 -- compound with empty statements and spurious leading spacing
 \;\;   SELECT 3 + 3 \;\;\;   SELECT ' ' || ' !' \;\;   SELECT 1 + 4 \;;
+ ?column? 
+--
+6
+(1 row)
+
+ ?column? 
+--
+   !
+(1 row)
+
  ?column? 
 --
 5
@@ -61,6 +81,11 @@ COMMIT;
 SELECT 1 + 1 + 1 AS "add" \gset
 SELECT :add + 1 + 1 AS "add" \;
 SELECT :add + 1 + 1 AS "add" \gset
+ add 
+-
+   5
+(1 row)
+
 -- set operator
 SELECT 1 AS i UNION SELECT 2 ORDER BY i;
  i 
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 1ab200a4ad..0a22850912 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -127,18 +127,11 @@ echo '\x \\ SELECT * FROM foo;' | psql
commands included in the string to divide it into multiple
transactions.  (See 
for more details about how the server handles multi-query strings.)
-   Also, psql only prints the
-   result of the last SQL command in the string.
-   This is different from the behavior when the same string is read from
-   a file or fed to psql's standard input,
-   because then psql sends
-   each SQL command separately.
   
   
-   Because of this behavior, putting more than one SQL command in a
-   single -c string often has unexpected results.
-   It's better to use repeated -c commands or feed
-   multiple commands to psql's standard input,
+   If having several commands executed in one transaction is not desired, 
+   use repeated -c commands or feed multiple commands to
+   psql's standard input,
either using echo as illustrated above, or
via a shell here-document, for example:
 
@@ -3570,10 +3563,6 @@ select 1\; select 2\; select 3;
 commands included in the string to divide it into multiple
 transactions.  (See 
 for more details about how the server handles multi-query strings.)
-psql prints only the last query result
-it receives for each request; in this example, although all
-three SELECTs are indeed executed, psql
-only prints the 3.
 
 
   
@@ -4160,6 +4149,18 @@ bar
   
 
   
+SHOW_ALL_RESULTS
+
+
+When this variable is set to off, only the last
+result of a combined query (\;) is shown instead of
+all of them.  The default is on.  The off behavior
+is for compatibility with older versions of psql.
+
+
+  
+
+   
 SHOW_CONTEXT
 
 
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 3503605a7d..47eabcbb8e 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -34,6 +34,8 @@ static bool DescribeQuery(const char *query, double *elapsed_msec);
 static bool ExecQueryUsingCursor(const char *query, double *elapsed_msec);
 static bool command_no_begin(const char *query);
 static bool is_select_command(const char *query);
+static int SendQueryAndProcessResults(const char *query, double *pelapsed_msec,
+	bool is_watch, const printQueryOpt *opt, FILE *printQueryFout, bool *tx_ended);
 
 
 /*
@@ -354,7 +356,7 @@ CheckConnection(void)
  * Returns true for valid result, false for error state.
  */
 static bool
-AcceptResult(const PGresult *result)
+AcceptResult(const PGresult *result, bool show_error)
 {
 	bool		OK;
 
@@ -385,7 +387,7 @@ AcceptResult(const PGresult *result)
 break;
 		}
 
-	if (!OK)
+	if (!OK && show_error)
 	{
 		const char *error = PQerrorMessage(pset.db);
 
@@ -473,6 +475,18 @@ ClearOrSaveResult(PGresult *result)
 	}
 }
 
+/*
+ * Consume all results
+ */
+static void
+ClearOrSaveAllResults()
+{
+	PGresult	*result;
+
+	while ((result = PQgetResult(pset.db)) != NULL)
+		ClearOrSaveResult(result);
+}
+
 
 /*
  * Print microtiming output.  Always print raw milliseconds; if the interval
@@ -573,7 +587,7 @@ PSQLexec(const char *query)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		res = NULL;
@@ -596,11 +610,8 @@ int
 PSQLexecWatch(const char *query, 

Re: pg_replslotdata - a tool for displaying replication slot information

2022-01-15 Thread Julien Rouhaud
Hi,

On Mon, Dec 06, 2021 at 07:16:12PM +, Bossart, Nathan wrote:
> On 12/5/21, 11:10 PM, "Michael Paquier"  wrote:
> > On Thu, Dec 02, 2021 at 08:32:08AM +0530, Bharath Rupireddy wrote:
> >> On Thu, Dec 2, 2021 at 4:22 AM Andres Freund  wrote:
>  I don't have any other compelling use- cases at the moment, but I will 
>  say
>  that it is typically nice from an administrative standpoint to be able to
>  inspect things like this without logging into a running server.
> >>>
> >>> Weighed against the cost of maintaining (including documentation) a 
> >>> separate
> >>> tool this doesn't seem sufficient reason.
> >> 
> >> IMHO, this shouldn't be a reason to not get something useful (useful
> >> IMO and few others in this thread) into the core. The maintenance of
> >> the tools generally is low compared to the core server features once
> >> they get reviewed and committed.
> >
> > Well, a bit less maintenance is always better than more maintenance.
> > An extra cost that you may be missing is related to the translation of
> > the documentation, as well as the translation of any new strings this
> > would require.  FWIW, I don't directly see a use for this tool that
> > could not be solved with an online server.
> 
> Bharath, perhaps you should maintain this outside of core PostgreSQL
> for now.  If some compelling use-cases ever surface that make it seem
> worth the added maintenance burden, this thread could probably be
> revisited.

Ironically, the patch is currently failing due to translation problem:

https://cirrus-ci.com/task/5467034313031680
[19:12:28.179] su postgres -c "make -s -j${BUILD_JOBS} world-bin"
[19:12:44.270] make[3]: *** No rule to make target 'po/cs.po', needed by 
'po/cs.mo'.  Stop.
[19:12:44.270] make[2]: *** [Makefile:44: all-pg_replslotdata-recurse] Error 2
[19:12:44.270] make[2]: *** Waiting for unfinished jobs
[19:12:44.499] make[1]: *** [Makefile:42: all-bin-recurse] Error 2
[19:12:44.499] make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2

Looking at the thread, I see support from 3 people:

- Bharath
- Japin
- Satyanarayana

while 3 committers think that the extra maintenance effort isn't worth the
usage:

- Peter E.
- Andres
- Michael

and a +0.5 from Nathan IIUC.

I also personally don't think that this worth the maintenance effort.  This
tool being entirely client side, there's no problem with maintaining it on a
separate repository, as mentioned by Nathan, including using it on the cloud
providers that provides access to at least the data file.  Another pro of the
external repo is that the tool can be made available immediately and for older
releases.

Since 3 committers voted against it I think that the patch should be closed
as "Rejected".  I will do that in a few days unless there's some compelling
objection by then.




Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-01-15 Thread Bharath Rupireddy
On Fri, Jan 14, 2022 at 1:08 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-12-31 18:12:37 +0530, Bharath Rupireddy wrote:
> > Currently the server is erroring out when unable to remove/parse a
> > logical rewrite file in CheckPointLogicalRewriteHeap wasting the
> > amount of work the checkpoint has done and preventing the checkpoint
> > from finishing.
>
> This seems like it'd make failures to remove the files practically
> invisible. Which'd have it's own set of problems?
>
> What motivated proposing this change?

We had an issue where there were many mapping files generated during
the crash recovery and end-of-recovery checkpoint was taking a lot of
time. We had to manually intervene and delete some of the mapping
files (although it may not sound sensible) to make end-of-recovery
checkpoint faster. Because of the race condition between manual
deletion and checkpoint deletion, the unlink error occurred which
crashed the server and the server entered the recovery again wasting
the entire earlier recovery work.

In summary, with the changes (emitting LOG-only messages for unlink
failures and continuing with the other files) proposed for
CheckPointLogicalRewriteHeap in this thread and the existing code in
CheckPointSnapBuild, I'm sure it will help not waste the recovery
that's has been done in case  unlink fails for any reasons.

Regards,
Bharath Rupireddy.




Re: missing indexes in indexlist with partitioned tables

2022-01-15 Thread Julien Rouhaud
Hi,

On Thu, Oct 28, 2021 at 01:44:31PM +, Arne Roland wrote:
> 
> The attached patch takes that route. I'd appreciate feedback!

The cfbot reports that the patch doesn't apply anymore:

http://cfbot.cputube.org/patch_36_3452.log
=== Applying patches on top of PostgreSQL commit ID 
025b920a3d45fed441a0a58fdcdf05b321b1eead ===
=== applying patch ./partIndexlistClean.patch
patching file src/backend/access/heap/vacuumlazy.c
Hunk #1 FAILED at 2375.
1 out of 1 hunk FAILED -- saving rejects to file 
src/backend/access/heap/vacuumlazy.c.rej
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 911 with fuzz 1 (offset 5 lines).
Hunk #2 FAILED at 5753.
[...]
1 out of 6 hunks FAILED -- saving rejects to file 
src/backend/access/transam/xlog.c.rej
[...]
patching file src/backend/commands/publicationcmds.c
Hunk #1 FAILED at 813.
1 out of 1 hunk FAILED -- saving rejects to file 
src/backend/commands/publicationcmds.c.rej
patching file src/include/nodes/pathnodes.h
Hunk #9 FAILED at 1516.
[...]
1 out of 17 hunks FAILED -- saving rejects to file 
src/include/nodes/pathnodes.h.rej

Could you send a rebased version?  In the meantime I will switch the cf entry
to Waiting on Author.




Re: pg14 psql broke \d datname.nspname.relname

2022-01-15 Thread Julien Rouhaud
Hi,

On Tue, Dec 21, 2021 at 10:58:39AM -0800, Mark Dilger wrote:
> 
> Rebased patch attached:

This version doesn't apply anymore:
http://cfbot.cputube.org/patch_36_3367.log
=== Applying patches on top of PostgreSQL commit ID 
5513dc6a304d8bda114004a3b906cc6fde5d6274 ===
=== applying patch 
./v3-0001-Reject-patterns-with-too-many-parts-or-wrong-db.patch
[...]
1 out of 52 hunks FAILED -- saving rejects to file src/bin/psql/describe.c.rej

Could you send a rebased version?  In the meantime I will switch the cf entry
to Waiting on Author.




Re: Pre-allocating WAL files

2022-01-15 Thread Bharath Rupireddy
On Sat, Jan 15, 2022 at 1:36 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 6, 2022 at 3:39 AM Bossart, Nathan  wrote:
> >
> > On 12/30/21, 3:52 AM, "Maxim Orlov"  wrote:
> > > I did check the patch too and found it to be ok. Check and check-world 
> > > are passed.
> > > Overall idea seems to be good in my opinion, but I'm not sure where is 
> > > the optimal place to put the pre-allocation.
> > >
> > > On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov  
> > > wrote:
> > >> I've checked the patch v7. It applies cleanly, code is good, check-world 
> > >> tests passed without problems.
> > >> I think it's ok to use checkpointer for this and the overall patch can 
> > >> be committed. But the seen performance gain makes me think again before 
> > >> adding this feature. I did tests myself a couple of months ago and got 
> > >> similar results.
> > >> Really don't know whether is it worth the effort.
> >
> > Thank you both for your review.
>
> It may have been discussed earlier, let me ask this here - IIUC the
> whole point of pre-allocating WAL files is that creating new WAL files
> of wal_segment_size requires us to write zero-filled empty pages to
> the disk which is costly. With the advent of
> fallocate/posix_fallocate, isn't file allocation going to be much
> faster on platforms where fallocate is supported? IIRC, the
> "Asynchronous and "direct" IO support for PostgreSQL." has a way to
> use fallocate. If at all, we move ahead and use fallocate, then the
> whole point of pre-allocating WAL files becomes unnecessary?
>
> Having said above, the idea of pre-allocating WAL files is still
> relevant, given the portability of fallocate/posix_fallocate.

Adding one more point: do we have any numbers like how much total time
WAL files allocation usually takes, maybe under a high-write load
server?

Regards,
Bharath Rupireddy.




Re: Pre-allocating WAL files

2022-01-15 Thread Bharath Rupireddy
On Thu, Jan 6, 2022 at 3:39 AM Bossart, Nathan  wrote:
>
> On 12/30/21, 3:52 AM, "Maxim Orlov"  wrote:
> > I did check the patch too and found it to be ok. Check and check-world are 
> > passed.
> > Overall idea seems to be good in my opinion, but I'm not sure where is the 
> > optimal place to put the pre-allocation.
> >
> > On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov  
> > wrote:
> >> I've checked the patch v7. It applies cleanly, code is good, check-world 
> >> tests passed without problems.
> >> I think it's ok to use checkpointer for this and the overall patch can be 
> >> committed. But the seen performance gain makes me think again before 
> >> adding this feature. I did tests myself a couple of months ago and got 
> >> similar results.
> >> Really don't know whether is it worth the effort.
>
> Thank you both for your review.

It may have been discussed earlier, let me ask this here - IIUC the
whole point of pre-allocating WAL files is that creating new WAL files
of wal_segment_size requires us to write zero-filled empty pages to
the disk which is costly. With the advent of
fallocate/posix_fallocate, isn't file allocation going to be much
faster on platforms where fallocate is supported? IIRC, the
"Asynchronous and "direct" IO support for PostgreSQL." has a way to
use fallocate. If at all, we move ahead and use fallocate, then the
whole point of pre-allocating WAL files becomes unnecessary?

Having said above, the idea of pre-allocating WAL files is still
relevant, given the portability of fallocate/posix_fallocate.

Regards,
Bharath Rupireddy.