Re: Improving connection scalability: GetSnapshotData()

2021-03-01 Thread Konstantin Knizhnik




On 27.02.2021 20:40, AJG wrote:

Hi,

Greatly appreciate if you could please reply to the following questions as
time allows.

I have seen previous discussion/patches on a built-in connection pooler. How
does this scalability improvement, particularly idle connection improvements
etc, affect that built-in pooler need, if any?


Same general question about an external connection pooler in general in
Production? Still required to route to different servers but no longer
needed for the pooling part. as an example.


Many Thanks!



Connection pooler is still needed.
The patch for GetSnapshotData() mostly improves scalability of read-only 
queries and reduce contention for procarray lock.
But read-write upload cause contention for many other resources: 
relation extension lock, buffer locks, tuple locks and so on.


If you run pgbench at NUMA machine with hundreds of cores, then you will 
still observe significant degradation of performance with increasing 
number of connection.
And this degradation will be dramatic if you replace some non-uniform 
distribution of keys, for example Zipfian distribution.









--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html








Re: Improving connection scalability: GetSnapshotData()

2021-03-01 Thread luis . roberto


- Mensagem original -
> De: "AJG" 
> Para: "Pg Hackers" 
> Enviadas: Sábado, 27 de fevereiro de 2021 14:40:58
> Assunto: Re: Improving connection scalability: GetSnapshotData()

> Hi,

> Greatly appreciate if you could please reply to the following questions as
> time allows.

> I have seen previous discussion/patches on a built-in connection pooler. How
> does this scalability improvement, particularly idle connection improvements
> etc, affect that built-in pooler need, if any?

> Same general question about an external connection pooler in general in
> Production? Still required to route to different servers but no longer
> needed for the pooling part. as an example.

> Many Thanks!

> --
> Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


As I understand it, the improvements made to GetSnapShotData() mean having 
higher connection count does not incur as much a penalty to performance as 
before. 
I am not sure it solves the connection stablishment side of things, but I may 
be wrong.

Luis R. Weck 




Re: Improving connection scalability: GetSnapshotData()

2021-03-01 Thread AJG
Hi,

Greatly appreciate if you could please reply to the following questions as
time allows.

I have seen previous discussion/patches on a built-in connection pooler. How
does this scalability improvement, particularly idle connection improvements
etc, affect that built-in pooler need, if any?


Same general question about an external connection pooler in general in
Production? Still required to route to different servers but no longer
needed for the pooling part. as an example.


Many Thanks!





--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: Improving connection scalability: GetSnapshotData()

2020-10-09 Thread Andrew Dunstan


On 10/5/20 10:33 PM, Andres Freund wrote:
> Hi,
>
> On 2020-10-01 19:21:14 -0400, Andrew Dunstan wrote:
>> On 10/1/20 4:22 PM, Andres Freund wrote:
>>> # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
>>> output
>>> # if we're using native Perl, but not if we're using MSys Perl.  So do 
>>> it
>>> # by hand in the latter case, here and elsewhere.
>>> that IPC::Run converts things, but that native windows perl uses
>>> https://perldoc.perl.org/perlrun#PERLIO
>>> a PERLIO that includes :crlf, whereas msys probably doesn't?
>>>
>>> Any chance you could run something like
>>> perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
>>> on both native and msys perl?
>>>
>>>
 possibly also for stderr, just to make it more futureproof, and at the
 top of the file:

 use Config;


 Do you want me to test that first?
>>> That'd be awesome.
>> The change I suggested makes jacana happy.
> Thanks, pushed. Hopefully that fixes the mingw animals.
>

I don't think we're out of the woods yet. This test is also have bad
effects on bowerbird, which is an MSVC animal. It's hanging completely :-(


Digging some more.


cheers


andrew


-- 
Andrew Dunstan
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-10-05 Thread Andres Freund
Hi,

On 2020-10-01 19:21:14 -0400, Andrew Dunstan wrote:
> On 10/1/20 4:22 PM, Andres Freund wrote:
> > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
> > output
> > # if we're using native Perl, but not if we're using MSys Perl.  So do 
> > it
> > # by hand in the latter case, here and elsewhere.
> > that IPC::Run converts things, but that native windows perl uses
> > https://perldoc.perl.org/perlrun#PERLIO
> > a PERLIO that includes :crlf, whereas msys probably doesn't?
> >
> > Any chance you could run something like
> > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
> > on both native and msys perl?
> >
> >
> >> possibly also for stderr, just to make it more futureproof, and at the
> >> top of the file:
> >>
> >> use Config;
> >>
> >>
> >> Do you want me to test that first?
> > That'd be awesome.

> The change I suggested makes jacana happy.

Thanks, pushed. Hopefully that fixes the mingw animals.

I wonder if we instead should do something like

# Have mingw perl treat CRLF the same way as perl on native windows does
ifeq ($(build_os),mingw32)
PROVE="PERLIO=unixcrlf $(PROVE)"
endif

in Makefile.global.in? Then we could remove these rexes from all the
various places?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Ian Barwick

On 2020/10/02 3:26, Andres Freund wrote:

Hi Ian, Andrew, All,

On 2020-09-30 15:43:17 -0700, Andres Freund wrote:

Attached is an updated version of the test (better utility function,
stricter regexes, bailing out instead of failing just the current when
psql times out).  I'm leaving it in this test for now, but it's fairly
easy to use this way, in my opinion, so it may be worth moving to
PostgresNode at some point.


I pushed this yesterday. Ian, thanks again for finding this and helping
with fixing & testing.


Thanks! Apologies for not getting back to your earlier responses,
have been distracted by Various Other Things.

The tests I run which originally triggered the issue now run just fine.


Regards

Ian Barwick


--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andrew Dunstan


On 10/1/20 4:22 PM, Andres Freund wrote:
> Hi,
>
> On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:
>> My strong suspicion is that we're getting unwanted CRs. Note the
>> presence of numerous instances of this in PostgresNode.pm:
>
>> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
>>
>> So you probably want something along those lines at the top of the loop
>> in send_query_and_wait:
>>
>> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
> Yikes, that's ugly :(.
>
>
> I assume it's not, as the comments says
>   # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
> output
>   # if we're using native Perl, but not if we're using MSys Perl.  So do 
> it
>   # by hand in the latter case, here and elsewhere.
> that IPC::Run converts things, but that native windows perl uses
> https://perldoc.perl.org/perlrun#PERLIO
> a PERLIO that includes :crlf, whereas msys probably doesn't?
>
> Any chance you could run something like
> perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
> on both native and msys perl?
>
>
>> possibly also for stderr, just to make it more futureproof, and at the
>> top of the file:
>>
>> use Config;
>>
>>
>> Do you want me to test that first?
> That'd be awesome.
>
>
>

The change I suggested makes jacana happy.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andres Freund
Hi,

On 2020-10-01 16:44:03 -0400, Andrew Dunstan wrote:
> > I assume it's not, as the comments says
> > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
> > output
> > # if we're using native Perl, but not if we're using MSys Perl.  So do 
> > it
> > # by hand in the latter case, here and elsewhere.
> > that IPC::Run converts things, but that native windows perl uses
> > https://perldoc.perl.org/perlrun#PERLIO
> > a PERLIO that includes :crlf, whereas msys probably doesn't?
> >
> > Any chance you could run something like
> > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
> > on both native and msys perl?
> 
> sys (jacana): stdio
> 
> native: unixcrlf

Interesting. That suggest we could get around needing the if msys
branches in several places by setting PERLIO to unixcrlf somewhere
centrally when using msys.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andrew Dunstan


On 10/1/20 4:22 PM, Andres Freund wrote:
> Hi,
>
> On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:
>> My strong suspicion is that we're getting unwanted CRs. Note the
>> presence of numerous instances of this in PostgresNode.pm:
>
>> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
>>
>> So you probably want something along those lines at the top of the loop
>> in send_query_and_wait:
>>
>> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
> Yikes, that's ugly :(.
>
>
> I assume it's not, as the comments says
>   # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
> output
>   # if we're using native Perl, but not if we're using MSys Perl.  So do 
> it
>   # by hand in the latter case, here and elsewhere.
> that IPC::Run converts things, but that native windows perl uses
> https://perldoc.perl.org/perlrun#PERLIO
> a PERLIO that includes :crlf, whereas msys probably doesn't?
>
> Any chance you could run something like
> perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
> on both native and msys perl?
>
>

sys (jacana): stdio

native: unixcrlf


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andres Freund
Hi,

On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote:
> My strong suspicion is that we're getting unwanted CRs. Note the
> presence of numerous instances of this in PostgresNode.pm:


> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
> 
> So you probably want something along those lines at the top of the loop
> in send_query_and_wait:
> 
> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

Yikes, that's ugly :(.


I assume it's not, as the comments says
# Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
output
# if we're using native Perl, but not if we're using MSys Perl.  So do 
it
# by hand in the latter case, here and elsewhere.
that IPC::Run converts things, but that native windows perl uses
https://perldoc.perl.org/perlrun#PERLIO
a PERLIO that includes :crlf, whereas msys probably doesn't?

Any chance you could run something like
perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
on both native and msys perl?


> possibly also for stderr, just to make it more futureproof, and at the
> top of the file:
> 
> use Config;
> 
> 

> Do you want me to test that first?

That'd be awesome.


> The difference between the canonical way perl states the regex is due to
> perl version differences. It shouldn't matter.

Thanks!

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andrew Dunstan


On 10/1/20 2:26 PM, Andres Freund wrote:
> Hi Ian, Andrew, All,
>
> On 2020-09-30 15:43:17 -0700, Andres Freund wrote:
>> Attached is an updated version of the test (better utility function,
>> stricter regexes, bailing out instead of failing just the current when
>> psql times out).  I'm leaving it in this test for now, but it's fairly
>> easy to use this way, in my opinion, so it may be worth moving to
>> PostgresNode at some point.
> I pushed this yesterday. Ian, thanks again for finding this and helping
> with fixing & testing.
>
>
> Unfortunately currently some buildfarm animals don't like the test for
> reasons I don't quite understand. Looks like it's all windows + msys
> animals that run the tap tests.  On jacana and fairywren the new test
> fails, but with a somewhat confusing problem:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-10-01%2015%3A32%3A34
> Bail out!  aborting wait: program timed out
> # stream contents: >>data
> # (0 rows)
> # <<
> # pattern searched for: (?m-xis:^\\(0 rows\\)$)
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-10-01%2014%3A12%3A13
> Bail out!  aborting wait: program timed out
> stream contents: >>data
> (0 rows)
> <<
> pattern searched for: (?^m:^\\(0 rows\\)$)
>
> I don't know with the -xis indicates on jacana, and why it's not present
> on fairywren. Nor do I know why the pattern doesn't match in the first
> place, sure looks like it should?
>
> Andrew, do you have an insight into how mingw's regex match differs
> from native windows and proper unixoid systems? I guess it's somewhere
> around line endings or such?
>
> Jacana successfully deals with 013_crash_restart.pl, which does use the
> same mechanis as the new 021_row_visibility.pl - I think the only real
> difference is that I used ^ and $ in the regexes in the latter...


My strong suspicion is that we're getting unwanted CRs. Note the
presence of numerous instances of this in PostgresNode.pm:

$stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

So you probably want something along those lines at the top of the loop
in send_query_and_wait:

$$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';

possibly also for stderr, just to make it more futureproof, and at the
top of the file:

use Config;


Do you want me to test that first?


The difference between the canonical way perl states the regex is due to
perl version differences. It shouldn't matter.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services






Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Andres Freund
Hi Ian, Andrew, All,

On 2020-09-30 15:43:17 -0700, Andres Freund wrote:
> Attached is an updated version of the test (better utility function,
> stricter regexes, bailing out instead of failing just the current when
> psql times out).  I'm leaving it in this test for now, but it's fairly
> easy to use this way, in my opinion, so it may be worth moving to
> PostgresNode at some point.

I pushed this yesterday. Ian, thanks again for finding this and helping
with fixing & testing.


Unfortunately currently some buildfarm animals don't like the test for
reasons I don't quite understand. Looks like it's all windows + msys
animals that run the tap tests.  On jacana and fairywren the new test
fails, but with a somewhat confusing problem:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-10-01%2015%3A32%3A34
Bail out!  aborting wait: program timed out
# stream contents: >>data
# (0 rows)
# <<
# pattern searched for: (?m-xis:^\\(0 rows\\)$)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-10-01%2014%3A12%3A13
Bail out!  aborting wait: program timed out
stream contents: >>data
(0 rows)
<<
pattern searched for: (?^m:^\\(0 rows\\)$)

I don't know with the -xis indicates on jacana, and why it's not present
on fairywren. Nor do I know why the pattern doesn't match in the first
place, sure looks like it should?

Andrew, do you have an insight into how mingw's regex match differs
from native windows and proper unixoid systems? I guess it's somewhere
around line endings or such?

Jacana successfully deals with 013_crash_restart.pl, which does use the
same mechanis as the new 021_row_visibility.pl - I think the only real
difference is that I used ^ and $ in the regexes in the latter...

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-10-01 Thread Craig Ringer
On Tue, 15 Sep 2020 at 07:17, Andres Freund  wrote:
>
> Hi,
>
> On 2020-09-09 17:02:58 +0900, Ian Barwick wrote:
> > Attached, though bear in mind I'm not very familiar with parts of this,
> > particularly 2PC stuff, so consider it educated guesswork.
>
> Thanks for this, and the test case!
>
> Your commit fixes the issues, but not quite correctly. Multixacts
> shouldn't matter, so we don't need to do anything there. And for the
> increases, I think they should be inside the already existing
> ProcArrayLock acquisition, as in the attached.
>
>
> I've also included a quite heavily revised version of your test. Instead
> of using dblink I went for having a long-running psql that I feed over
> stdin. The main reason for not liking the previous version is that it
> seems fragile, with the sleep and everything.  I expanded it to cover
> 2PC is as well.
>
> The test probably needs a bit of cleanup, wrapping some of the
> redundancy around the pump_until calls.
>
> I think the approach of having a long running psql session is really
> useful, and probably would speed up some tests. Does anybody have a good
> idea for how to best, and without undue effort, to integrate this into
> PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
> it with a local helper in the new test?

2ndQ has some infra for that and various other TAP enhancements that
I'd like to try to upstream. I'll ask what I can share and how.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise




Re: Improving connection scalability: GetSnapshotData()

2020-09-30 Thread Andres Freund
Hi,

On 2020-09-14 16:17:18 -0700, Andres Freund wrote:
> I've also included a quite heavily revised version of your test. Instead
> of using dblink I went for having a long-running psql that I feed over
> stdin. The main reason for not liking the previous version is that it
> seems fragile, with the sleep and everything.  I expanded it to cover
> 2PC is as well.
> 
> The test probably needs a bit of cleanup, wrapping some of the
> redundancy around the pump_until calls.
> 
> I think the approach of having a long running psql session is really
> useful, and probably would speed up some tests. Does anybody have a good
> idea for how to best, and without undue effort, to integrate this into
> PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
> it with a local helper in the new test?

Attached is an updated version of the test (better utility function,
stricter regexes, bailing out instead of failing just the current when
psql times out).  I'm leaving it in this test for now, but it's fairly
easy to use this way, in my opinion, so it may be worth moving to
PostgresNode at some point.

Greetings,

Andres Freund
>From 2ca4fa9de369aeba0d6386ec7d749cb366259728 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 14 Sep 2020 16:08:25 -0700
Subject: [PATCH v3] Fix and test snapshot behaviour on standby.

I (Andres) broke this in 623a9ba79bb, because I didn't think about the
way snapshots are built on standbys sufficiently. Unfortunately our
existing tests did not catch this, as they are all just querying with
psql (therefore ending up with fresh snapshots).

The fix is trivial, we just need to increment the completion counter
in ExpireTreeKnownAssignedTransactionIds(), which is the equivalent of
ProcArrayEndTransaction() during recovery.

This commit also adds a new test doing some basic testing of the
correctness of snapshots built on standbys. To avoid the
aforementioned issue of one-shot psql's not exercising the snapshot
caching, the test uses a long lived psqls, similar to
013_crash_restart.pl. It'd be good to extend the test further.

Reported-By: Ian Barwick 
Author: Andres Freund 
Author: Ian Barwick 
Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb...@2ndquadrant.com
---
 src/backend/storage/ipc/procarray.c   |   3 +
 src/test/recovery/t/021_row_visibility.pl | 192 ++
 2 files changed, 195 insertions(+)
 create mode 100644 src/test/recovery/t/021_row_visibility.pl

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 5aaeb6e2b55..07c5eeb7495 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4280,6 +4280,9 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
+	/* ... and xactCompletionCount */
+	ShmemVariableCache->xactCompletionCount++;
+
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
new file mode 100644
index 000..95516b05d01
--- /dev/null
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -0,0 +1,192 @@
+# Checks that snapshots on standbys behave in a minimally reasonable
+# way.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_primary->start;
+
+# Initialize with empty test table
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE public.test_visibility (data text not null)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(30);
+
+# One psql to primary and standby each, for all queries. That allows
+# to check uncommitted changes being replicated and such.
+my %psql_primary = (stdin => '', stdout => '', stderr => '');
+$psql_primary{run} =
+  IPC::Run::start(
+	  ['psql', '-XA', '-f', '-', '-d', $node_primary->connstr('postgres')],
+	  '<', \$psql_primary{stdin},
+	  '>', \$psql_primary{stdout},
+	  '2>', \$psql_primary{stderr},
+	  $psql_timeout);
+
+my %psql_standby = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby{run} =
+  IPC::Run::start(
+	  ['psql', '-XA', '-f', '-', '-d', 

Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-15 11:56:24 +0900, Michael Paquier wrote:
> On Mon, Sep 14, 2020 at 05:42:51PM -0700, Andres Freund wrote:
> > My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
> > isn't pretty. Just as 013_crash_restart.pl already did (even before
> > psql/t/010_tab_completion.pl). I am mostly wondering whether we could
> > avoid copying the utility functions into multiple test files...
> > 
> > Does IO::Pty work on windows? Given that currently the test doesn't use
> > a pty and that there's no benefit I can see in requiring one, I'm a bit
> > hesitant to go there?
> 
> Per https://metacpan.org/pod/IO::Tty:
> "Windows is now supported, but ONLY under the Cygwin environment, see
> http://sources.redhat.com/cygwin/.;
> 
> So I would suggest to make stuff a soft dependency (as Tom is
> hinting?), and not worry about Windows specifically.  It is not like
> what we are dealing with here is specific to Windows anyway, so you
> would have already sufficient coverage.  I would not mind if any
> refactoring is done later, once we know that the proposed test is
> stable in the buildfarm as we would get a better image of what part of
> the facility overlaps across multiple tests.

I'm confused - the test as posted should work on windows, and we already
do this in an existing test (src/test/recovery/t/013_crash_restart.pl). What's
the point in adding a platforms specific dependency here?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Michael Paquier
On Mon, Sep 14, 2020 at 05:42:51PM -0700, Andres Freund wrote:
> My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
> isn't pretty. Just as 013_crash_restart.pl already did (even before
> psql/t/010_tab_completion.pl). I am mostly wondering whether we could
> avoid copying the utility functions into multiple test files...
> 
> Does IO::Pty work on windows? Given that currently the test doesn't use
> a pty and that there's no benefit I can see in requiring one, I'm a bit
> hesitant to go there?

Per https://metacpan.org/pod/IO::Tty:
"Windows is now supported, but ONLY under the Cygwin environment, see
http://sources.redhat.com/cygwin/.;

So I would suggest to make stuff a soft dependency (as Tom is
hinting?), and not worry about Windows specifically.  It is not like
what we are dealing with here is specific to Windows anyway, so you
would have already sufficient coverage.  I would not mind if any
refactoring is done later, once we know that the proposed test is
stable in the buildfarm as we would get a better image of what part of
the facility overlaps across multiple tests.
--
Michael


signature.asc
Description: PGP signature


Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-14 20:14:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I think the approach of having a long running psql session is really
> > useful, and probably would speed up some tests. Does anybody have a good
> > idea for how to best, and without undue effort, to integrate this into
> > PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
> > it with a local helper in the new test?
> 
> You could use the interactive_psql infrastructure that already exists
> for psql/t/010_tab_completion.pl.  That does rely on IO::Pty, but
> I think I'd prefer to accept that dependency for such tests over rolling
> our own IPC::Run, which is more or less what you've done here.

My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess
isn't pretty. Just as 013_crash_restart.pl already did (even before
psql/t/010_tab_completion.pl). I am mostly wondering whether we could
avoid copying the utility functions into multiple test files...

Does IO::Pty work on windows? Given that currently the test doesn't use
a pty and that there's no benefit I can see in requiring one, I'm a bit
hesitant to go there?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Tom Lane
Andres Freund  writes:
> I think the approach of having a long running psql session is really
> useful, and probably would speed up some tests. Does anybody have a good
> idea for how to best, and without undue effort, to integrate this into
> PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
> it with a local helper in the new test?

You could use the interactive_psql infrastructure that already exists
for psql/t/010_tab_completion.pl.  That does rely on IO::Pty, but
I think I'd prefer to accept that dependency for such tests over rolling
our own IPC::Run, which is more or less what you've done here.

regards, tom lane




Re: Improving connection scalability: GetSnapshotData()

2020-09-14 Thread Andres Freund
Hi,

On 2020-09-09 17:02:58 +0900, Ian Barwick wrote:
> Attached, though bear in mind I'm not very familiar with parts of this,
> particularly 2PC stuff, so consider it educated guesswork.

Thanks for this, and the test case!

Your commit fixes the issues, but not quite correctly. Multixacts
shouldn't matter, so we don't need to do anything there. And for the
increases, I think they should be inside the already existing
ProcArrayLock acquisition, as in the attached.


I've also included a quite heavily revised version of your test. Instead
of using dblink I went for having a long-running psql that I feed over
stdin. The main reason for not liking the previous version is that it
seems fragile, with the sleep and everything.  I expanded it to cover
2PC is as well.

The test probably needs a bit of cleanup, wrapping some of the
redundancy around the pump_until calls.

I think the approach of having a long running psql session is really
useful, and probably would speed up some tests. Does anybody have a good
idea for how to best, and without undue effort, to integrate this into
PostgresNode.pm?  I don't really have a great idea, so I think I'd leave
it with a local helper in the new test?

Regards,

Andres
>From a637b65fc53b208857e0d3d17141d8ed3609036f Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 14 Sep 2020 16:08:25 -0700
Subject: [PATCH v2] WIP: fix and test snapshot behaviour on standby.

Reported-By: Ian Barwick 
Author: Andres Freund 
Author: Ian Barwick 
Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb...@2ndquadrant.com
---
 src/backend/storage/ipc/procarray.c   |   3 +
 src/test/recovery/t/021_row_visibility.pl | 227 ++
 2 files changed, 230 insertions(+)
 create mode 100644 src/test/recovery/t/021_row_visibility.pl

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 802b119c490..fffa5f7a93e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -4280,6 +4280,9 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids,
 	/* As in ProcArrayEndTransaction, advance latestCompletedXid */
 	MaintainLatestCompletedXidRecovery(max_xid);
 
+	/* ... and xactCompletionCount */
+	ShmemVariableCache->xactCompletionCount++;
+
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
new file mode 100644
index 000..08713fa2686
--- /dev/null
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -0,0 +1,227 @@
+# Checks that a standby session can see all expected rows
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_primary->start;
+
+# Initialize with empty test table
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE public.test_visibility (data text not null)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', 'max_prepared_transactions=10');
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(5);
+
+
+# One psql to primary for all queries. That allows to check
+# uncommitted changes being replicated and such.
+my ($psql_primary_stdin, $psql_primary_stdout, $psql_primary_stderr) = ('', '', '');
+my $psql_primary = IPC::Run::start(
+	[
+		'psql', '-X', '-qAe', '-f', '-', '-d',
+		$node_primary->connstr('postgres')
+	],
+	'<',
+	\$psql_primary_stdin,
+	'>',
+	\$psql_primary_stdout,
+	'2>',
+	\$psql_primary_stderr,
+	$psql_timeout);
+
+# One psql to standby for all queries. That allows to reuse the same
+# session for multiple queries, which is important to detect some
+# types of errors.
+my ($psql_standby_stdin, $psql_standby_stdout, $psql_standby_stderr) = ('', '', '');
+my $psql_standby = IPC::Run::start(
+	[
+		'psql', '-X', '-qAe', '-f', '-', '-d',
+		$node_standby->connstr('postgres')
+	],
+	'<',
+	\$psql_standby_stdin,
+	'>',
+	\$psql_standby_stdout,
+	'2>',
+	\$psql_standby_stderr,
+	$psql_timeout);
+
+#
+# 1. Check initial data is the same
+#
+$psql_standby_stdin .= q[
+SELECT * FROM test_visibility ORDER BY data;
+  ];
+ok(pump_until($psql_standby, \$psql_standby_stdout, qr/0 rows/m),
+   'data not visible');
+$psql_standby_stdout = '';
+$psql_standby_stderr = '';
+
+
+#
+# 2. Check if an INSERT is replayed and visible
+#

Re: Improving connection scalability: GetSnapshotData()

2020-09-09 Thread Ian Barwick

On 2020/09/08 13:23, Ian Barwick wrote:

On 2020/09/08 13:11, Andres Freund wrote:

Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:

(...)

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

 *** a/src/backend/access/transam/xact.c
 --- b/src/backend/access/transam/xact.c
 *** xact_redo_commit(xl_xact_parsed_commit *
 *** 5915,5920 
 --- 5915,5924 
  */
 if (XactCompletionApplyFeedback(parsed->xinfo))
 XLogRequestWalReceiverReply();
 +
 +   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 +   ShmemVariableCache->xactCompletionCount++;
 +   LWLockRelease(ProcArrayLock);
   }

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).


We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).


Yup.


At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?


Sure, I'll give it a go as I have some time right now.



Attached, though bear in mind I'm not very familiar with parts of this,
particularly 2PC stuff, so consider it educated guesswork.


Regards


Ian Barwick

--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
commit 544e2b1661413fe08e3083f03063c12c0d7cf3aa
Author: Ian Barwick 
Date:   Tue Sep 8 12:24:14 2020 +0900

Fix snapshot caching on standbys

Addresses issue introduced in 623a9ba.

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..227d03bbce 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3320,6 +3320,14 @@ multixact_redo(XLogReaderState *record)
 	}
 	else
 		elog(PANIC, "multixact_redo: unknown op code %u", info);
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 Datum
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb1..04ca858918 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5915,6 +5915,14 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	 */
 	if (XactCompletionApplyFeedback(parsed->xinfo))
 		XLogRequestWalReceiverReply();
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 /*
@@ -5978,6 +5986,14 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 void


Re: Improving connection scalability: GetSnapshotData()

2020-09-09 Thread Ian Barwick

On 2020/09/09 2:53, Andres Freund wrote:

Hi,

On 2020-09-08 16:44:17 +1200, Thomas Munro wrote:

On Tue, Sep 8, 2020 at 4:11 PM Andres Freund  wrote:

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...


I prototyped a TAP test patch that could maybe do the sort of thing
you need, in patch 0006 over at [1].  Later versions of that patch set
dropped it, because I figured out how to use the isolation tester
instead, but I guess you can't do that for a standby test (at least
not until someone teaches the isolation tester to support multi-node
schedules, something that would be extremely useful...).


Unfortunately proper multi-node isolationtester test basically is
equivalent to building a global lock graph. I think, at least? Including
a need to be able to correlate connections with their locks between the
nodes.

But for something like the bug at hand it'd probably sufficient to just
"hack" something with dblink. In session 1) insert a row on the primary
using dblink, return the LSN, wait for the LSN to have replicated and
finally in session 2) check for row visibility.


The attached seems to do the trick.


Regards


Ian Barwick


--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
commit b31d587d71f75115b02dd1bf6230a56722c67832
Author: Ian Barwick 
Date:   Wed Sep 9 14:37:40 2020 +0900

test for standby row visibility

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index fa8e031526..2d9a9701fc 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/dblink
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl
new file mode 100644
index 00..5f591d131e
--- /dev/null
+++ b/src/test/recovery/t/021_row_visibility.pl
@@ -0,0 +1,84 @@
+# Checks that a standby session can see all expected rows
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+$node_primary->safe_psql('postgres',
+	'CREATE EXTENSION dblink');
+
+
+# Add an arbitrary table
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE public.foo (id INT)');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create streaming standby from backup
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+sleep(5);
+# Check row visibility in an existing standby session
+
+my ($res, $stdout, $stderr) = $node_standby->psql(
+'postgres',
+sprintf(
+<<'EO_SQL',
+DO $$
+  DECLARE
+primary_lsn pg_lsn;
+insert_xmin xid;
+standby_rec RECORD;
+  BEGIN
+SELECT INTO primary_lsn, insert_xmin
+   t1.primary_lsn, t1.xmin
+   FROM dblink(
+  'host=%s port=%i dbname=postgres',
+  'INSERT INTO public.foo VALUES (1) RETURNING pg_catalog.pg_current_wal_lsn(), xmin'
+   ) AS t1(primary_lsn pg_lsn, xmin xid);
+
+LOOP
+  EXIT WHEN pg_catalog.pg_last_wal_replay_lsn() > primary_lsn;
+END LOOP;
+
+SELECT INTO standby_rec
+   id
+  FROM public.foo
+ WHERE id = 1 AND xmin = insert_xmin;
+
+IF FOUND
+  THEN
+RAISE NOTICE 'row found';
+  ELSE
+RAISE NOTICE 'row not found';
+END IF;
+
+  END;
+$$;
+EO_SQL
+$node_primary->host,
+$node_primary->port,
+),
+);
+
+
+like (
+$stderr,
+qr/row found/,
+'check that inserted row is visible on the standby',
+);
+
+$node_primary->stop;
+$node_standby->stop;


Re: Improving connection scalability: GetSnapshotData()

2020-09-08 Thread Andres Freund
Hi,

On 2020-06-07 11:24:50 +0300, Michail Nikolaev wrote:
> Hello, hackers.
> Andres, nice work!
> 
> Sorry for the off-top.
> 
> Some of my work [1] related to the support of index hint bits on
> standby is highly interfering with this patch.
> Is it safe to consider it committed and start rebasing on top of the patches?

Sorry, I missed this email. Since they're now committed, yes, it is safe
;)

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-08 Thread Andres Freund
Hi,

On 2020-09-08 16:44:17 +1200, Thomas Munro wrote:
> On Tue, Sep 8, 2020 at 4:11 PM Andres Freund  wrote:
> > At first I was very confused as to why none of the existing tests have
> > found this significant issue. But after thinking about it for a minute
> > that's because they all use psql, and largely separate psql invocations
> > for each query :(. Which means that there's no cached snapshot around...
> 
> I prototyped a TAP test patch that could maybe do the sort of thing
> you need, in patch 0006 over at [1].  Later versions of that patch set
> dropped it, because I figured out how to use the isolation tester
> instead, but I guess you can't do that for a standby test (at least
> not until someone teaches the isolation tester to support multi-node
> schedules, something that would be extremely useful...).

Unfortunately proper multi-node isolationtester test basically is
equivalent to building a global lock graph. I think, at least? Including
a need to be able to correlate connections with their locks between the
nodes.

But for something like the bug at hand it'd probably sufficient to just
"hack" something with dblink. In session 1) insert a row on the primary
using dblink, return the LSN, wait for the LSN to have replicated and
finally in session 2) check for row visibility.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-08 Thread Konstantin Knizhnik




On 07.09.2020 23:45, Andres Freund wrote:

Hi,


On Mon, Sep 7, 2020, at 07:20, Konstantin Knizhnik wrote:

And which pgbench database scale factor you have used?

200

Another thing you could try is to run 2-4 pgench instances in different
databases.

I tried to reinitialize database with scale 200 but there was no
significant improvement in performance.

If you're replying to the last bit I am quoting, I was talking about having 
four databases with separate pbench tables etc. To see how much of it is 
procarray contention, and how much it is contention of common buffers etc.

Sorry, I have tested hypothesis that the difference in performance in my 
and you cases can be explained by size of the table which can have 
influence on shared buffer  contention.
Thus is why I used the same scale as you, but there is no difference 
compatring with scale 100.


And definitely Postgres performance in this test is limited by lock 
contention (most likely shared buffers locks, rather than procarray locks).
If I create two instances of postgres, both with pgbench -s 200 database 
and run two pgbenches with 100 connections each, then
each instance shows the same ~1million TPS (1186483) as been launched 
standalone. And total TPS is 2.3 millions.



Attachments:
* pgbench.svg

What numactl was used for this one?


None. I have not used numactl in this case.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Thomas Munro
On Tue, Sep 8, 2020 at 4:11 PM Andres Freund  wrote:
> At first I was very confused as to why none of the existing tests have
> found this significant issue. But after thinking about it for a minute
> that's because they all use psql, and largely separate psql invocations
> for each query :(. Which means that there's no cached snapshot around...

I prototyped a TAP test patch that could maybe do the sort of thing
you need, in patch 0006 over at [1].  Later versions of that patch set
dropped it, because I figured out how to use the isolation tester
instead, but I guess you can't do that for a standby test (at least
not until someone teaches the isolation tester to support multi-node
schedules, something that would be extremely useful...).  Example:

+# start an interactive session that we can use to interleave statements
+my $session = PsqlSession->new($node, "postgres");
+$session->send("\\set PROMPT1 ''\n", 2);
+$session->send("\\set PROMPT2 ''\n", 1);
...
+# our snapshot is not too old yet, so we can still use it
+@lines = $session->send("select * from t order by i limit 1;\n", 2);
+shift @lines;
+$result = shift @lines;
+is($result, "1");
...
+# our snapshot is too old!  the thing it wants to see has been removed
+@lines = $session->send("select * from t order by i limit 1;\n", 2);
+shift @lines;
+$result = shift @lines;
+is($result, "ERROR:  snapshot too old");

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BFkUuDv-bcBns%3DZ_O-V9QGW0nWZNHOkEPxHZWjegRXvw%40mail.gmail.com




Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Ian Barwick

On 2020/09/08 13:11, Andres Freund wrote:

Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:

(...)

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

 *** a/src/backend/access/transam/xact.c
 --- b/src/backend/access/transam/xact.c
 *** xact_redo_commit(xl_xact_parsed_commit *
 *** 5915,5920 
 --- 5915,5924 
  */
 if (XactCompletionApplyFeedback(parsed->xinfo))
 XLogRequestWalReceiverReply();
 +
 +   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 +   ShmemVariableCache->xactCompletionCount++;
 +   LWLockRelease(ProcArrayLock);
   }

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).


We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).


Yup.


At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?


Sure, I'll give it a go as I have some time right now.


Regards

Ian Barwick

--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Andres Freund
Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:
> On 2020/09/03 17:18, Michael Paquier wrote:
> > On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:
> > > So we get some builfarm results while thinking about this.
> > 
> > Andres, there is an entry in the CF for this thread:
> > https://commitfest.postgresql.org/29/2500/
> > 
> > A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
> 
> I haven't seen it mentioned here, so apologies if I've overlooked
> something, but as of 623a9ba queries on standbys seem somewhat
> broken.
> 
> Specifically, I maintain some code which does something like this:
> 
> - connects to a standby
> - checks a particular row does not exist on the standby
> - connects to the primary
> - writes a row in the primary
> - polls the standby (using the same connection as above)
>   to verify the row arrives on the standby
> 
> As of recent HEAD it never sees the row arrive on the standby, even
> though it is verifiably there.

Ugh, that's not good.


> I've traced this back to 623a9ba, which relies on "xactCompletionCount"
> being incremented to determine whether the snapshot can be reused,
> but that never happens on a standby, meaning this test in
> GetSnapshotDataReuse():
> 
> if (curXactCompletionCount != snapshot->snapXactCompletionCount)
> return false;
> 
> will never return false, and the snapshot's xmin/xmax never get advanced.
> Which means the session on the standby is not able to see rows on the
> standby added after the session was started.
> 
> It's simple enough to prevent that being an issue by just never calling
> GetSnapshotDataReuse() if the snapshot was taken during recovery
> (though obviously that means any performance benefits won't be available
> on standbys).

Yea, that doesn't sound great. Nor is the additional branch welcome.


> I wonder if it's possible to increment "xactCompletionCount"
> during replay along these lines:
> 
> *** a/src/backend/access/transam/xact.c
> --- b/src/backend/access/transam/xact.c
> *** xact_redo_commit(xl_xact_parsed_commit *
> *** 5915,5920 
> --- 5915,5924 
>  */
> if (XactCompletionApplyFeedback(parsed->xinfo))
> XLogRequestWalReceiverReply();
> +
> +   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> +   ShmemVariableCache->xactCompletionCount++;
> +   LWLockRelease(ProcArrayLock);
>   }
> 
> which seems to work (though quite possibly I've overlooked something I don't
> know that I don't know about and it will all break horribly somewhere,
> etc. etc.).

We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Ian Barwick

On 2020/09/03 17:18, Michael Paquier wrote:

On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:

So we get some builfarm results while thinking about this.


Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.


I haven't seen it mentioned here, so apologies if I've overlooked
something, but as of 623a9ba queries on standbys seem somewhat
broken.

Specifically, I maintain some code which does something like this:

- connects to a standby
- checks a particular row does not exist on the standby
- connects to the primary
- writes a row in the primary
- polls the standby (using the same connection as above)
  to verify the row arrives on the standby

As of recent HEAD it never sees the row arrive on the standby, even
though it is verifiably there.

I've traced this back to 623a9ba, which relies on "xactCompletionCount"
being incremented to determine whether the snapshot can be reused,
but that never happens on a standby, meaning this test in
GetSnapshotDataReuse():

if (curXactCompletionCount != snapshot->snapXactCompletionCount)
return false;

will never return false, and the snapshot's xmin/xmax never get advanced.
Which means the session on the standby is not able to see rows on the
standby added after the session was started.

It's simple enough to prevent that being an issue by just never calling
GetSnapshotDataReuse() if the snapshot was taken during recovery
(though obviously that means any performance benefits won't be available
on standbys).

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*** xact_redo_commit(xl_xact_parsed_commit *
*** 5915,5920 
--- 5915,5924 
 */
if (XactCompletionApplyFeedback(parsed->xinfo))
XLogRequestWalReceiverReply();
+
+   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+   ShmemVariableCache->xactCompletionCount++;
+   LWLockRelease(ProcArrayLock);
  }

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).


Regards

Ian Barwick


--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Andres Freund
Hi,


On Mon, Sep 7, 2020, at 07:20, Konstantin Knizhnik wrote:
> >> And which pgbench database scale factor you have used?
> > 200
> >
> > Another thing you could try is to run 2-4 pgench instances in different
> > databases.
> I tried to reinitialize database with scale 200 but there was no 
> significant improvement in performance.

If you're replying to the last bit I am quoting, I was talking about having 
four databases with separate pbench tables etc. To see how much of it is 
procarray contention, and how much it is contention of common buffers etc.


> Attachments:
> * pgbench.svg

What numactl was used for this one?




Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Konstantin Knizhnik




On 06.09.2020 21:52, Andres Freund wrote:

Hi,

On 2020-09-05 16:58:31 +0300, Konstantin Knizhnik wrote:

On 04.09.2020 21:53, Andres Freund wrote:

I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.

As far as I understand there should not be no any TLB misses because size of
the shared buffers (8Mb) as several order of magnitude smaler that available
physical memory.

I assume you didn't mean 8MB but 8GB? If so, that's way large enough to
be bigger than the TLB, particularly across processes (IIRC there's no
optimization to keep shared mappings de-duplicated between processes
from the view of the TLB).



Sorry, certainly 8Gb.
I tried huge pages, but it has almost no effect/




Re: Improving connection scalability: GetSnapshotData()

2020-09-06 Thread Andres Freund
Hi,

On 2020-09-06 14:05:35 +0300, Konstantin Knizhnik wrote:
> On 04.09.2020 21:53, Andres Freund wrote:
> > 
> > > May be it is because of more complex architecture of my server?
> > Think we'll need profiles to know...
> 
> This is "perf top" of pgebch -c 100 -j 100 -M prepared -S
> 
>   12.16%  postgres   [.] PinBuffer
>   11.92%  postgres   [.] LWLockAttemptLock
>    6.46%  postgres   [.] UnpinBuffer.constprop.11
>    6.03%  postgres   [.] LWLockRelease
>    3.14%  postgres   [.] BufferGetBlockNumber
>    3.04%  postgres   [.] ReadBuffer_common
>    2.13%  [kernel]   [k] _raw_spin_lock_irqsave
>    2.11%  [kernel]   [k] switch_mm_irqs_off
>    1.95%  postgres   [.] _bt_compare
> 
> 
> Looks like most of the time is pent in buffers locks.

Hm, that is interesting / odd. If you record a profile with call graphs
(e.g. --call-graph dwarf), where are all the LWLockAttemptLock calls
comming from?


I assume the machine you're talking about is an 8 socket machine?

What if you:
a) start postgres and pgbench with numactl --interleave=all
b) start postgres with numactl --interleave=0,1 --cpunodebind=0,1 --membind=0,1
   in case you have 4 sockets, or 0,1,2,3 in case you have 8 sockets?


> And which pgbench database scale factor you have used?

200

Another thing you could try is to run 2-4 pgench instances in different
databases.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-06 Thread Andres Freund
Hi,

On 2020-09-05 16:58:31 +0300, Konstantin Knizhnik wrote:
> On 04.09.2020 21:53, Andres Freund wrote:
> > 
> > I also used huge_pages=on / configured them on the OS level. Otherwise
> > TLB misses will be a significant factor.
> 
> As far as I understand there should not be no any TLB misses because size of
> the shared buffers (8Mb) as several order of magnitude smaler that available
> physical memory.

I assume you didn't mean 8MB but 8GB? If so, that's way large enough to
be bigger than the TLB, particularly across processes (IIRC there's no
optimization to keep shared mappings de-duplicated between processes
from the view of the TLB).


> Yes, there is also noticeable difference in my case
> 
> | Idle Connections | Active Connections | TPS pre | TPS post |
> |-:|---:|:|-:|
> | 5000 | 48 |  758914 |  1184085 |

Sounds like you're somehow hitting another bottleneck around 1.2M TPS

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-06 Thread Konstantin Knizhnik




On 04.09.2020 21:53, Andres Freund wrote:



May be it is because of more complex architecture of my server?

Think we'll need profiles to know...


This is "perf top" of pgebch -c 100 -j 100 -M prepared -S

  12.16%  postgres   [.] PinBuffer
  11.92%  postgres   [.] LWLockAttemptLock
   6.46%  postgres   [.] UnpinBuffer.constprop.11
   6.03%  postgres   [.] LWLockRelease
   3.14%  postgres   [.] BufferGetBlockNumber
   3.04%  postgres   [.] ReadBuffer_common
   2.13%  [kernel]   [k] _raw_spin_lock_irqsave
   2.11%  [kernel]   [k] switch_mm_irqs_off
   1.95%  postgres   [.] _bt_compare


Looks like most of the time is pent in buffers locks.
And which pgbench database scale factor you have used?




Re: Improving connection scalability: GetSnapshotData()

2020-09-05 Thread Konstantin Knizhnik




On 04.09.2020 21:53, Andres Freund wrote:


I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.


As far as I understand there should not be no any TLB misses because 
size of the shared buffers (8Mb) as several order of magnitude smaler 
that available physical memory.


Does it change if you initialize the test database using
PGOPTIONS='-c vacuum_freeze_min_age=0' pgbench -i -s 100
or run a manual VACUUM FREEZE; after initialization?

I tried it, but didn't see any improvement.



Hm, it'd probably be good to compare commits closer to the changes, to
avoid other changes showing up.

Hm - did you verify if all the connections were actually established?
Particularly without the patch applied? With an unmodified pgbench, I
sometimes saw better numbers, but only because only half the connections
were able to be established, due to ProcArrayLock contention.
Yes, that really happen quite often at IBM Power2 server (specific of 
it's atomic implementation).
I even have to patch pgbench  by adding one second delay after 
connection has been established to make it possible  for all clients to 
connect.
But at Intel server I didn't see unconnected clients. And in any case - 
it happen only for large number of connections (> 1000).
But the best performance was achieved at about 100 connections and still 
I can not reach 2k TPS performance a in your case.



Did you connect via tcp or unix socket? Was pgbench running on the same
machine? It was locally via unix socket for me (but it's also observable
via two machines, just with lower overall throughput).


Pgbench was launched at the same machine and connected through unix sockets.


Did you run a profile to see where the bottleneck is?

Sorry I do not have root privileges at this server and so can not use perf.


There's a seperate benchmark that I found to be quite revealing that's
far less dependent on scheduler behaviour. Run two pgbench instances:

1) With a very simply script '\sleep 1s' or such, and many connections
(e.g. 100,1000,5000). That's to simulate connections that are
currently idle.
2) With a normal pgbench read only script, and low client counts.

Before the changes 2) shows a very sharp decline in performance when the
count in 1) increases. Afterwards its pretty much linear.

I think this benchmark actually is much more real world oriented - due
to latency and client side overheads it's very normal to have a large
fraction of connections idle in read mostly OLTP workloads.

Here's the result on my workstation (2x Xeon Gold 5215 CPUs), testing
1f42d35a1d6144a23602b2c0bc7f97f3046cf890 against
07f32fcd23ac81898ed47f88beb569c631a2f223 which are the commits pre/post
connection scalability changes.

I used fairly short pgbench runs (15s), and the numbers are the best of
three runs. I also had emacs and mutt open - some noise to be
expected. But I also gotta work ;)

| Idle Connections | Active Connections | TPS pre | TPS post |
|-:|---:|:|-:|
|0 |  1 |   33599 |33406 |
|  100 |  1 |   31088 |33279 |
| 1000 |  1 |   29377 |33434 |
| 2500 |  1 |   27050 |33149 |
| 5000 |  1 |   21895 |33903 |
|1 |  1 |   16034 |33140 |
|0 | 48 | 1042005 |  1125104 |
|  100 | 48 |  986731 |  1103584 |
| 1000 | 48 |  854230 |  1119043 |
| 2500 | 48 |  716624 |  1119353 |
| 5000 | 48 |  553657 |  1119476 |
|1 | 48 |  369845 |  1115740 |


Yes, there is also noticeable difference in my case

| Idle Connections | Active Connections | TPS pre | TPS post |
|-:|---:|:|-:|
| 5000 | 48 |  758914 |  1184085 |


Think we'll need profiles to know...


I will try to obtain sudo permissions and do profiling.




Re: Improving connection scalability: GetSnapshotData()

2020-09-04 Thread Michael Paquier
On Fri, Sep 04, 2020 at 10:35:19AM -0700, Andres Freund wrote:
> I think it's best to close the entry. There's substantial further wins
> possible, in particular not acquiring ProcArrayLock in GetSnapshotData()
> when the cache is valid improves performance substantially. But it's
> non-trivial enough that it's probably best dealth with in a separate
> patch / CF entry.

Cool, thanks for updating the CF entry.
--
Michael


signature.asc
Description: PGP signature


Re: Improving connection scalability: GetSnapshotData()

2020-09-04 Thread Andres Freund
On 2020-09-04 11:53:04 -0700, Andres Freund wrote:
> There's a seperate benchmark that I found to be quite revealing that's
> far less dependent on scheduler behaviour. Run two pgbench instances:
> 
> 1) With a very simply script '\sleep 1s' or such, and many connections
>(e.g. 100,1000,5000). That's to simulate connections that are
>currently idle.
> 2) With a normal pgbench read only script, and low client counts.
> 
> Before the changes 2) shows a very sharp decline in performance when the
> count in 1) increases. Afterwards its pretty much linear.
> 
> I think this benchmark actually is much more real world oriented - due
> to latency and client side overheads it's very normal to have a large
> fraction of connections idle in read mostly OLTP workloads.
> 
> Here's the result on my workstation (2x Xeon Gold 5215 CPUs), testing
> 1f42d35a1d6144a23602b2c0bc7f97f3046cf890 against
> 07f32fcd23ac81898ed47f88beb569c631a2f223 which are the commits pre/post
> connection scalability changes.
> 
> I used fairly short pgbench runs (15s), and the numbers are the best of
> three runs. I also had emacs and mutt open - some noise to be
> expected. But I also gotta work ;)
> 
> | Idle Connections | Active Connections | TPS pre | TPS post |
> |-:|---:|:|-:|
> |0 |  1 |   33599 |33406 |
> |  100 |  1 |   31088 |33279 |
> | 1000 |  1 |   29377 |33434 |
> | 2500 |  1 |   27050 |33149 |
> | 5000 |  1 |   21895 |33903 |
> |1 |  1 |   16034 |33140 |
> |0 | 48 | 1042005 |  1125104 |
> |  100 | 48 |  986731 |  1103584 |
> | 1000 | 48 |  854230 |  1119043 |
> | 2500 | 48 |  716624 |  1119353 |
> | 5000 | 48 |  553657 |  1119476 |
> |1 | 48 |  369845 |  1115740 |

Attached in graph form.

Greetings,

Andres Freund


Re: Improving connection scalability: GetSnapshotData()

2020-09-04 Thread Andres Freund
Hi,

On 2020-09-04 18:24:12 +0300, Konstantin Knizhnik wrote:
> Reported results looks very impressive.
> But I tried to reproduce them and didn't observed similar behavior.
> So I am wondering what can be the difference and what I am doing wrong.

That is odd - I did reproduce it on quite a few systems by now.


> Configuration file has the following differences with default postgres config:
> 
> max_connections = 1   # (change requires restart)
> shared_buffers = 8GB  # min 128kB

I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.

Does it change if you initialize the test database using
PGOPTIONS='-c vacuum_freeze_min_age=0' pgbench -i -s 100
or run a manual VACUUM FREEZE; after initialization?


> I have tried two different systems.
> First one is IBM Power2 server with 384 cores and 8Tb of RAM.
> I run the same read-only pgbench test as you. I do not think that size of the 
> database is matter, so I used scale 100 -
> it seems to be enough to avoid frequent buffer conflicts.
> Then I run the same scripts as you:
>
>  for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T 
> 100 -j $n -M prepared -S -n postgres ;  done
>  for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T 
> 100 -j $n -M prepared -S -n postgres ;  done
>
>
> I have compared current master with version of Postgres prior to your commits 
> with scalability improvements: a9a4a7ad56

Hm, it'd probably be good to compare commits closer to the changes, to
avoid other changes showing up.

Hm - did you verify if all the connections were actually established?
Particularly without the patch applied? With an unmodified pgbench, I
sometimes saw better numbers, but only because only half the connections
were able to be established, due to ProcArrayLock contention.

See 
https://www.postgresql.org/message-id/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de

There also is the issue that pgbench numbers for inclusive/exclusive are
just about meaningless right now:
https://www.postgresql.org/message-id/20200227202636.qaf7o6qcajsudoor%40alap3.anarazel.de
(reminds me, need to get that fixed)


One more thing worth investigating is whether your results change
significantly when you start the server using
numactl --interleave=all .
Especially on larger systems the results otherwise can vary a lot from
run-to-run, because the placement of shared buffers matters a lot.


> So I have repeated experiments at Intel server.
> It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM.
>
> The same database, the same script, results are the following:
>
> Clients   old/inc old/exl new/inc new/exl
> 1000  1105750 1163292 1206105 1212701
> 2000  1050933 1124688 1149706 1164942
> 3000  1063667 1195158 1118087 1144216
> 4000  1040065 1290432 1107348 1163906
> 5000  943813  1258643 1103790 1160251

> I have separately show results including/excluding connection connections 
> establishing,
> because in new version there are almost no differences between them,
> but for old version gap between them is noticeable.
>
> Configuration file has the following differences with default postgres config:
>
> max_connections = 1   # (change requires restart)
> shared_buffers = 8GB  # min 128kB

>
> This results contradict with yours and makes me ask the following questions:

> 1. Why in your case performance is almost two times larger (2 millions vs 1)?
> The hardware in my case seems to be at least not worser than yours...
> May be there are some other improvements in the version you have tested which 
> are not yet committed to master?

No, no uncommitted changes, except for the pgbench stuff mentioned
above. However I found that the kernel version matters a fair bit, it's
pretty easy to run into kernel scalability issues in a workload that is
this heavy scheduler dependent.

Did you connect via tcp or unix socket? Was pgbench running on the same
machine? It was locally via unix socket for me (but it's also observable
via two machines, just with lower overall throughput).

Did you run a profile to see where the bottleneck is?


There's a seperate benchmark that I found to be quite revealing that's
far less dependent on scheduler behaviour. Run two pgbench instances:

1) With a very simply script '\sleep 1s' or such, and many connections
   (e.g. 100,1000,5000). That's to simulate connections that are
   currently idle.
2) With a normal pgbench read only script, and low client counts.

Before the changes 2) shows a very sharp decline in performance when the
count in 1) increases. Afterwards its pretty much linear.

I think this benchmark actually is much more real world oriented - due
to latency and client side overheads it's very normal 

Re: Improving connection scalability: GetSnapshotData()

2020-09-04 Thread Andres Freund
Hi,

On 2020-09-03 17:18:29 +0900, Michael Paquier wrote:
> On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:
> > So we get some builfarm results while thinking about this.
> 
> Andres, there is an entry in the CF for this thread:
> https://commitfest.postgresql.org/29/2500/
> 
> A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
> Now that PGXACT is done, how much work is remaining here?

I think it's best to close the entry. There's substantial further wins
possible, in particular not acquiring ProcArrayLock in GetSnapshotData()
when the cache is valid improves performance substantially. But it's
non-trivial enough that it's probably best dealth with in a separate
patch / CF entry.

Closed.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-09-04 Thread Konstantin Knizhnik



On 03.09.2020 11:18, Michael Paquier wrote:

On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:

So we get some builfarm results while thinking about this.

Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
Now that PGXACT is done, how much work is remaining here?
--
Michael


Andres,
First of all a lot of thanks for this work.
Improving Postgres connection scalability is very important.

Reported results looks very impressive.
But I tried to reproduce them and didn't observed similar behavior.
So I am wondering what can be the difference and what I am doing wrong.

I have tried two different systems.
First one is IBM Power2 server with 384 cores and 8Tb of RAM.
I run the same read-only pgbench test as you. I do not think that size of the 
database is matter, so I used scale 100 -
it seems to be enough to avoid frequent buffer conflicts.
Then I run the same scripts as you:

 for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T 100 
-j $n -M prepared -S -n postgres ;  done
 for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T 
100 -j $n -M prepared -S -n postgres ;  done


I have compared current master with version of Postgres prior to your commits 
with scalability improvements: a9a4a7ad56

For all number of connections older version shows slightly better results, for 
example for 500 clients: 475k TPS vs. 450k TPS for current master.

This is quite exotic server and I do not have currently access to it.
So I have repeated experiments at Intel server.
It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM.

The same database, the same script, results are the following:

Clients old/inc old/exl new/inc new/exl
10001105750 1163292 1206105 1212701
20001050933 1124688 1149706 1164942
30001063667 1195158 1118087 1144216
40001040065 1290432 1107348 1163906
5000943813  1258643 1103790 1160251

I have separately show results including/excluding connection connections 
establishing,
because in new version there are almost no differences between them,
but for old version gap between them is noticeable.

Configuration file has the following differences with default postgres config:

max_connections = 1 # (change requires restart)
shared_buffers = 8GB# min 128kB


This results contradict with yours and makes me ask the following questions:

1. Why in your case performance is almost two times larger (2 millions vs 1)?
The hardware in my case seems to be at least not worser than yours...
May be there are some other improvements in the version you have tested which 
are not yet committed to master?

2. You wrote: This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized (2 sockets of 18 cores/36 
threads)

According to Intel specification Intel® Xeon® Platinum 8168 Processor has 24 
cores:
https://ark.intel.com/content/www/us/en/ark/products/120504/intel-xeon-platinum-8168-processor-33m-cache-2-70-ghz.html

And at your graph we can see almost linear increase of speed up to 40 
connections.

But most suspicious word for me is "virtualized". What is the actual hardware 
and how it is virtualized?

Do you have any idea why in my case master version (with your commits) behaves 
almost the same as non-patched version?
Below is yet another table showing scalability from 10 to 100 connections and 
combining your results (first two columns) and my results (last two columns):


Clients old master  pgxact-split-cache  current master
revision 9a4a7ad56
10  367883  375682  358984
347067
20  748000  810964  668631
630304
30  999231  1288276 920255
848244
40  991672  1573310 1100745
970717
50
1017561 1715762 1193928
1008755
60
993943  1789698 1255629
917788
70
971379  1819477 1277634
873022
80
966276  1842248 1266523
830197
90
901175  1847823 1255260
736550
100
803175  1865795 1241143
736756


May be it is because of more complex architecture of my server?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Improving connection scalability: GetSnapshotData()

2020-09-03 Thread Michael Paquier
On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:
> So we get some builfarm results while thinking about this.

Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
Now that PGXACT is done, how much work is remaining here?
--
Michael


signature.asc
Description: PGP signature


Re: Improving connection scalability: GetSnapshotData()

2020-08-17 Thread Alvaro Herrera
On 2020-Aug-16, Peter Geoghegan wrote:

> On Sun, Aug 16, 2020 at 2:11 PM Andres Freund  wrote:
> > For the first, one issue is that there's no obviously good candidate for
> > an uninitialized xid. We could use something like FrozenTransactionId,
> > which may never be in the procarray. But it's not exactly pretty.
> 
> Maybe it would make sense to mark the fields as inaccessible or
> undefined to Valgrind. That has advantages and disadvantages that are
> obvious.

... and perhaps making Valgrind complain about it is sufficient.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 17:28:46 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > For the first, one issue is that there's no obviously good candidate for
> > an uninitialized xid. We could use something like FrozenTransactionId,
> > which may never be in the procarray. But it's not exactly pretty.
> 
> Huh?  What's wrong with using InvalidTransactionId?

It's a normal value for a backend when it doesn't have an xid assigned.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Tom Lane
Andres Freund  writes:
> For the first, one issue is that there's no obviously good candidate for
> an uninitialized xid. We could use something like FrozenTransactionId,
> which may never be in the procarray. But it's not exactly pretty.

Huh?  What's wrong with using InvalidTransactionId?

regards, tom lane




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Peter Geoghegan
On Sun, Aug 16, 2020 at 2:11 PM Andres Freund  wrote:
> For the first, one issue is that there's no obviously good candidate for
> an uninitialized xid. We could use something like FrozenTransactionId,
> which may never be in the procarray. But it's not exactly pretty.

Maybe it would make sense to mark the fields as inaccessible or
undefined to Valgrind. That has advantages and disadvantages that are
obvious.

If that isn't enough, it might not hurt to do this on top of whatever
becomes the primary solution. An undefined value has the advantage of
"spreading" when the value gets copied around.

-- 
Peter Geoghegan




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 14:11:46 -0700, Andres Freund wrote:
> On 2020-08-16 13:52:58 -0700, Andres Freund wrote:
> > On 2020-08-16 13:31:53 -0700, Andres Freund wrote:
> > Gna, I think I see the problem.  In at least one place I wrongly
> > accessed the 'dense' array of in-progress xids using the 'pgprocno',
> > instead of directly using the [0...procArray->numProcs) index.
> > 
> > Working on a fix, together with some improved asserts.
> 
> diff --git i/src/backend/storage/ipc/procarray.c 
> w/src/backend/storage/ipc/procarray.c
> index 8262abd42e6..96e4a878576 100644
> --- i/src/backend/storage/ipc/procarray.c
> +++ w/src/backend/storage/ipc/procarray.c
> @@ -1663,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
>  TransactionId xmin;
>  
>  /* Fetch xid just once - see GetNewTransactionId */
> -xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
> +xid = UINT32_ACCESS_ONCE(other_xids[index]);
>  xmin = UINT32_ACCESS_ONCE(proc->xmin);
>  
>  /*
> 
> indeed fixes the issue based on a number of iterations of your modified
> test, and fixes a clear bug.

Pushed that much.


> WRT better asserts: We could make ProcArrayRemove() and InitProcGlobal()
> initialize currently unused procArray->pgprocnos,
> procGlobal->{xids,subxidStates,vacuumFlags} to invalid values and/or
> declare them as uninitialized using the valgrind helpers.
> 
> For the first, one issue is that there's no obviously good candidate for
> an uninitialized xid. We could use something like FrozenTransactionId,
> which may never be in the procarray. But it's not exactly pretty.
> 
> Opinions?

So we get some builfarm results while thinking about this.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 13:52:58 -0700, Andres Freund wrote:
> On 2020-08-16 13:31:53 -0700, Andres Freund wrote:
> > I now luckily have a rr trace of the problem, so I hope I can narrow it
> > down to the original problem fairly quickly.
> 
> Gna, I think I see the problem.  In at least one place I wrongly
> accessed the 'dense' array of in-progress xids using the 'pgprocno',
> instead of directly using the [0...procArray->numProcs) index.
> 
> Working on a fix, together with some improved asserts.

diff --git i/src/backend/storage/ipc/procarray.c 
w/src/backend/storage/ipc/procarray.c
index 8262abd42e6..96e4a878576 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -1663,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 TransactionId xmin;
 
 /* Fetch xid just once - see GetNewTransactionId */
-xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
+xid = UINT32_ACCESS_ONCE(other_xids[index]);
 xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 /*

indeed fixes the issue based on a number of iterations of your modified
test, and fixes a clear bug.

WRT better asserts: We could make ProcArrayRemove() and InitProcGlobal()
initialize currently unused procArray->pgprocnos,
procGlobal->{xids,subxidStates,vacuumFlags} to invalid values and/or
declare them as uninitialized using the valgrind helpers.

For the first, one issue is that there's no obviously good candidate for
an uninitialized xid. We could use something like FrozenTransactionId,
which may never be in the procarray. But it's not exactly pretty.

Opinions?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 13:31:53 -0700, Andres Freund wrote:
> I now luckily have a rr trace of the problem, so I hope I can narrow it
> down to the original problem fairly quickly.

Gna, I think I see the problem.  In at least one place I wrongly
accessed the 'dense' array of in-progress xids using the 'pgprocno',
instead of directly using the [0...procArray->numProcs) index.

Working on a fix, together with some improved asserts.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 16:17:23 -0400, Tom Lane wrote:
> I wrote:
> > It seems entirely likely that there's a timing component in this, for
> > instance autovacuum coming along at just the right time.
> 
> D'oh.  The attached seems to make it 100% reproducible.

Great!  It interestingly didn't work as the first item on the schedule,
where I had duplicated it it to out of impatience. I guess there might
be some need of concurrent autovacuum activity or something like that.

I now luckily have a rr trace of the problem, so I hope I can narrow it
down to the original problem fairly quickly.

Thanks,

Andres




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Tom Lane
I wrote:
> It seems entirely likely that there's a timing component in this, for
> instance autovacuum coming along at just the right time.

D'oh.  The attached seems to make it 100% reproducible.

regards, tom lane

diff --git a/src/test/isolation/specs/freeze-the-dead.spec b/src/test/isolation/specs/freeze-the-dead.spec
index 915bf15b92..4100d9fc6f 100644
--- a/src/test/isolation/specs/freeze-the-dead.spec
+++ b/src/test/isolation/specs/freeze-the-dead.spec
@@ -32,6 +32,7 @@ session "s2"
 step "s2_begin"		{ BEGIN; }
 step "s2_key_share"	{ SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; }
 step "s2_commit"	{ COMMIT; }
+step "s2_wait"		{ select pg_sleep(60); }
 step "s2_vacuum"	{ VACUUM FREEZE tab_freeze; }
 
 session "s3"
@@ -49,6 +50,7 @@ permutation "s1_begin" "s2_begin" "s3_begin" # start transactions
"s1_update" "s2_key_share" "s3_key_share" # have xmax be a multi with an updater, updater being oldest xid
"s1_update" # create additional row version that has multis
"s1_commit" "s2_commit" # commit both updater and share locker
+   "s2_wait"
"s2_vacuum" # due to bug in freezing logic, we used to *not* prune updated row, and then froze it
"s1_selectone" # if hot chain is broken, the row can't be found via index scan
"s3_commit" # commit remaining open xact


Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
On 2020-08-16 14:30:24 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > 690 successful runs later, it didn't trigger for me :(. Seems pretty
> > clear that there's another variable than pure chance, otherwise it seems
> > like that number of runs should have hit the issue, given the number of
> > bf hits vs bf runs.
> 
> It seems entirely likely that there's a timing component in this, for
> instance autovacuum coming along at just the right time.  It's not too
> surprising that some machines would be more prone to show that than
> others.  (Note peripatus is FreeBSD, which we've already learned has
> significantly different kernel scheduler behavior than Linux.)

Yea. Interestingly there was a reproduction on linux since the initial
reports you'd dug up:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=butterflyfish=2020-08-15%2019%3A54%3A53

but that's likely a virtualized environment, so I guess the host
scheduler behaviour could play a similar role.

I'll run a few iterations with rr's chaos mode too, which tries to
randomize scheduling decisions...


I noticed that it's quite hard to actually hit the hot tuple path I
mentioned earlier on my machine. Would probably be good to have a tests
hitting it more reliably. But I'm not immediately seeing how we could
force the necessarily serialization.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Tom Lane
Andres Freund  writes:
> 690 successful runs later, it didn't trigger for me :(. Seems pretty
> clear that there's another variable than pure chance, otherwise it seems
> like that number of runs should have hit the issue, given the number of
> bf hits vs bf runs.

It seems entirely likely that there's a timing component in this, for
instance autovacuum coming along at just the right time.  It's not too
surprising that some machines would be more prone to show that than
others.  (Note peripatus is FreeBSD, which we've already learned has
significantly different kernel scheduler behavior than Linux.)

> My current plan would is to push a bit of additional instrumentation to
> help narrow down the issue.

+1

regards, tom lane




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-15 09:42:00 -0700, Andres Freund wrote:
> On 2020-08-15 11:10:51 -0400, Tom Lane wrote:
> > We have two essentially identical buildfarm failures since these patches
> > went in:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly=2020-08-15%2011%3A27%3A32
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus=2020-08-15%2003%3A09%3A14
> >
> > They're both in the same place in the freeze-the-dead isolation test:
> 
> > TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", 
> > File: "heapam.c", Line: 6051)
> > 0x9613eb  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x52d586  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x53bc7e  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x6949bb  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x694532  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x693d1c  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x8324b3
> > ...
> > 2020-08-14 22:16:41.783 CDT [78410:4] LOG:  server process (PID 80395) was 
> > terminated by signal 6: Abort trap
> > 2020-08-14 22:16:41.783 CDT [78410:5] DETAIL:  Failed process was running: 
> > VACUUM FREEZE tab_freeze;
> >
> > peripatus has successes since this failure, so it's not fully reproducible
> > on that machine.  I'm suspicious of a timing problem in computing vacuum's
> > cutoff_xid.
> 
> Hm, maybe it's something around what I observed in
> https://www.postgresql.org/message-id/20200723181018.neey2jd3u7rfrfrn%40alap3.anarazel.de
> 
> I.e. that somehow we end up with hot pruning and freezing coming to a
> different determination, and trying to freeze a hot tuple.
> 
> I'll try to add a few additional asserts here, and burn some cpu tests
> trying to trigger the issue.
> 
> I gotta escape the heat in the house for a few hours though (no AC
> here), so I'll not look at the results till later this afternoon, unless
> it triggers soon.

690 successful runs later, it didn't trigger for me :(. Seems pretty
clear that there's another variable than pure chance, otherwise it seems
like that number of runs should have hit the issue, given the number of
bf hits vs bf runs.

My current plan would is to push a bit of additional instrumentation to
help narrow down the issue. We can afterwards decide what of that we'd
like to keep longer term, and what not.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-15 Thread Andres Freund
Hi,

On 2020-08-15 11:10:51 -0400, Tom Lane wrote:
> We have two essentially identical buildfarm failures since these patches
> went in:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly=2020-08-15%2011%3A27%3A32
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus=2020-08-15%2003%3A09%3A14
>
> They're both in the same place in the freeze-the-dead isolation test:

> TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", 
> File: "heapam.c", Line: 6051)
> 0x9613eb  at 
> /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> 0x52d586  at 
> /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> 0x53bc7e  at 
> /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> 0x6949bb  at 
> /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> 0x694532  at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> 0x693d1c  at 
> /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> 0x8324b3
> ...
> 2020-08-14 22:16:41.783 CDT [78410:4] LOG:  server process (PID 80395) was 
> terminated by signal 6: Abort trap
> 2020-08-14 22:16:41.783 CDT [78410:5] DETAIL:  Failed process was running: 
> VACUUM FREEZE tab_freeze;
>
> peripatus has successes since this failure, so it's not fully reproducible
> on that machine.  I'm suspicious of a timing problem in computing vacuum's
> cutoff_xid.

Hm, maybe it's something around what I observed in
https://www.postgresql.org/message-id/20200723181018.neey2jd3u7rfrfrn%40alap3.anarazel.de

I.e. that somehow we end up with hot pruning and freezing coming to a
different determination, and trying to freeze a hot tuple.

I'll try to add a few additional asserts here, and burn some cpu tests
trying to trigger the issue.

I gotta escape the heat in the house for a few hours though (no AC
here), so I'll not look at the results till later this afternoon, unless
it triggers soon.


> (I'm also wondering why the failing check is an Assert rather than a real
> test-and-elog.  Assert doesn't seem like an appropriate way to check for
> plausible data corruption cases.)

Robert, and to a lesser degree you, gave me quite a bit of grief over
converting nearby asserts to elogs. I agree it'd be better if it were
an assert, but ...

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-15 Thread Tom Lane
We have two essentially identical buildfarm failures since these patches
went in:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly=2020-08-15%2011%3A27%3A32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus=2020-08-15%2003%3A09%3A14

They're both in the same place in the freeze-the-dead isolation test:

TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", 
File: "heapam.c", Line: 6051)
0x9613eb  at 
/home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x52d586  at 
/home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x53bc7e  at 
/home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x6949bb  at 
/home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x694532  at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x693d1c  at 
/home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
0x8324b3
...
2020-08-14 22:16:41.783 CDT [78410:4] LOG:  server process (PID 80395) was 
terminated by signal 6: Abort trap
2020-08-14 22:16:41.783 CDT [78410:5] DETAIL:  Failed process was running: 
VACUUM FREEZE tab_freeze;

peripatus has successes since this failure, so it's not fully reproducible
on that machine.  I'm suspicious of a timing problem in computing vacuum's
cutoff_xid.

(I'm also wondering why the failing check is an Assert rather than a real
test-and-elog.  Assert doesn't seem like an appropriate way to check for
plausible data corruption cases.)

regards, tom lane




Re: Improving connection scalability: GetSnapshotData()

2020-08-11 Thread Thomas Munro
On Wed, Aug 12, 2020 at 12:19 PM Andres Freund  wrote:
> On 2020-07-29 19:20:04 +1200, Thomas Munro wrote:
> > On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro  wrote:
> > > +static inline FullTransactionId
> > > +FullXidViaRelative(FullTransactionId rel, TransactionId xid)
> > >
> > > I'm struggling to find a better word for this than "relative".
> >
> > The best I've got is "anchor" xid.  It is an xid that is known to
> > limit nextFullXid's range while the receiving function runs.
>
> Thinking about it, I think that relative is a good descriptor. It's just
> that 'via' is weird. How about: FullXidRelativeTo?

WFM.




Re: Improving connection scalability: GetSnapshotData()

2020-08-11 Thread Andres Freund
Hi,

On 2020-07-29 19:20:04 +1200, Thomas Munro wrote:
> On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro  wrote:
> > +static inline FullTransactionId
> > +FullXidViaRelative(FullTransactionId rel, TransactionId xid)
> >
> > I'm struggling to find a better word for this than "relative".
> 
> The best I've got is "anchor" xid.  It is an xid that is known to
> limit nextFullXid's range while the receiving function runs.

Thinking about it, I think that relative is a good descriptor. It's just
that 'via' is weird. How about: FullXidRelativeTo?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-07-31 Thread Daniel Gustafsson
> On 24 Jul 2020, at 03:11, Andres Freund  wrote:

> I've done that in the attached.

As this is actively being reviewed but time is running short, I'm moving this
to the next CF.

cheers ./daniel




Re: Improving connection scalability: GetSnapshotData()

2020-07-29 Thread Thomas Munro
On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro  wrote:
> +static inline FullTransactionId
> +FullXidViaRelative(FullTransactionId rel, TransactionId xid)
>
> I'm struggling to find a better word for this than "relative".

The best I've got is "anchor" xid.  It is an xid that is known to
limit nextFullXid's range while the receiving function runs.




Re: Improving connection scalability: GetSnapshotData()

2020-07-29 Thread Thomas Munro
On Fri, Jul 24, 2020 at 1:11 PM Andres Freund  wrote:
> On 2020-07-15 21:33:06 -0400, Alvaro Herrera wrote:
> > On 2020-Jul-15, Andres Freund wrote:
> > > It could make sense to split the conversion of
> > > VariableCacheData->latestCompletedXid to FullTransactionId out from 0001
> > > into is own commit. Not sure...
> >
> > +1, the commit is large enough and that change can be had in advance.
>
> I've done that in the attached.

+ * pair with the memory barrier below.  We do however accept xid to be <=
+ * to next_xid, instead of just <, as xid could be from the procarray,
+ * before we see the updated nextFullXid value.

Tricky.  Right, that makes sense.  I like the range assertion.

+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)

I'm struggling to find a better word for this than "relative".

+return FullTransactionIdFromU64(U64FromFullTransactionId(rel)
++ (int32) (xid - rel_xid));

I like your branch-free code for this.

> I wonder if somebody has an opinion on renaming latestCompletedXid to
> latestCompletedFullXid. That's the pattern we already had (cf
> nextFullXid), but it also leads to pretty long lines and quite a few
> comment etc changes.
>
> I'm somewhat inclined to remove the "Full" out of the variable, and to
> also do that for nextFullXid. I feel like including it in the variable
> name is basically a poor copy of the (also not great) C type system.  If
> we hadn't made FullTransactionId a struct I'd see it differently (and
> thus incompatible with TransactionId), but we have ...

Yeah, I'm OK with dropping the "Full".  I've found it rather clumsy too.




Re: Improving connection scalability: GetSnapshotData()

2020-07-25 Thread Ranier Vilela
Em sex., 24 de jul. de 2020 às 21:00, Andres Freund 
escreveu:

> On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote:
> > Em sex., 24 de jul. de 2020 às 14:16, Andres Freund 
> > escreveu:
> >
> > > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:
> > > > Latest Postgres
> > > > Windows 64 bits
> > > > msvc 2019 64 bits
> > > >
> > > > Patches applied v12-0001 to v12-0007:
> > > >
> > > >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning
> > > C4013:
> > > > 'GetOldestXmin' indefinido; assumindo extern retornando int
> > > > [C:\dll\postgres
> > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29):
> warning
> > > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
> > > > [C:\dll\postgres\pg_visibility.
> > > > vcxproj]
> > > >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error
> C2065:
> > > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > > [C:\dll\postgres\pgstattuple.vcxproj]
> > > >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58):
> error
> > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > > [C:\dll\postgres\pg_visibility.vcxproj]
> > > >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70):
> error
> > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > > [C:\dll\postgres\pg_visibility.vcxproj]
> > >
> > > I don't know that's about - there's no call to GetOldestXmin() in
> > > pgstatapprox and pg_visibility after patch 0002? And similarly, the
> > > PROCARRAY_* references are also removed in the same patch?
> > >
> > Maybe need to remove them from these places, not?
> > C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
> > File pgstattuple\pgstatapprox.c:
> > OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
> > File pg_visibility\pg_visibility.c:
> > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
> >  * deadlocks, because surely
> > GetOldestXmin() should never take
> > RecomputedOldestXmin =
> GetOldestXmin(NULL,
> > PROCARRAY_FLAGS_VACUUM);
>
> The 0002 patch changed those files:
>
> diff --git a/contrib/pg_visibility/pg_visibility.c
> b/contrib/pg_visibility/pg_visibility.c
> index 68d580ed1e0..37206c50a21 100644
> --- a/contrib/pg_visibility/pg_visibility.c
> +++ b/contrib/pg_visibility/pg_visibility.c
> @@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible,
> bool all_frozen)
> BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
> TransactionId OldestXmin = InvalidTransactionId;
>
> -   if (all_visible)
> -   {
> -   /* Don't pass rel; that will fail in recovery. */
> -   OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
> -   }
> -
> rel = relation_open(relid, AccessShareLock);
>
> /* Only some relkinds have a visibility map */
> check_relation_relkind(rel);
>
> +   if (all_visible)
> +   OldestXmin = GetOldestNonRemovableTransactionId(rel);
> +
> nblocks = RelationGetNumberOfBlocks(rel);
>
> /*
> @@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible,
> bool all_frozen)
>  * From a concurrency point of view, it
> sort of sucks to
>  * retake ProcArrayLock here while we're
> holding the buffer
>  * exclusively locked, but it should be
> safe against
> -* deadlocks, because surely
> GetOldestXmin() should never take
> -* a buffer lock. And this shouldn't
> happen often, so it's
> -* worth being careful so as to avoid
> false positives.
> +* deadlocks, because surely
> GetOldestNonRemovableTransactionId()
> +* should never take a buffer lock. And
> this shouldn't happen
> +* often, so it's worth being careful so
> as to avoid false
> +* positives.
>  */
> -   RecomputedOldestXmin = GetOldestXmin(NULL,
> PROCARRAY_FLAGS_VACUUM);
> +   RecomputedOldestXmin =
> GetOldestNonRemovableTransactionId(rel);
>
> if (!TransactionIdPrecedes(OldestXmin,
> RecomputedOldestXmin))
> record_corrupt_item(items,
> _self);
>
> diff --git a/contrib/pgstattuple/pgstatapprox.c
> b/contrib/pgstattuple/pgstatapprox.c
> index dbc0fa11f61..3a99333d443 100644
> --- a/contrib/pgstattuple/pgstatapprox.c
> +++ b/contrib/pgstattuple/pgstatapprox.c
> @@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
> BufferAccessStrategy bstrategy;
> TransactionId OldestXmin;
>
> -   OldestXmin = GetOldestXmin(rel, 

Re: Improving connection scalability: GetSnapshotData()

2020-07-24 Thread Andres Freund
On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote:
> Em sex., 24 de jul. de 2020 às 14:16, Andres Freund 
> escreveu:
> 
> > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:
> > > Latest Postgres
> > > Windows 64 bits
> > > msvc 2019 64 bits
> > >
> > > Patches applied v12-0001 to v12-0007:
> > >
> > >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning
> > C4013:
> > > 'GetOldestXmin' indefinido; assumindo extern retornando int
> > > [C:\dll\postgres
> > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
> > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
> > > [C:\dll\postgres\pg_visibility.
> > > vcxproj]
> > >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
> > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > [C:\dll\postgres\pgstattuple.vcxproj]
> > >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
> > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > [C:\dll\postgres\pg_visibility.vcxproj]
> > >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
> > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > [C:\dll\postgres\pg_visibility.vcxproj]
> >
> > I don't know that's about - there's no call to GetOldestXmin() in
> > pgstatapprox and pg_visibility after patch 0002? And similarly, the
> > PROCARRAY_* references are also removed in the same patch?
> >
> Maybe need to remove them from these places, not?
> C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
> File pgstattuple\pgstatapprox.c:
> OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
> File pg_visibility\pg_visibility.c:
> OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
>  * deadlocks, because surely
> GetOldestXmin() should never take
> RecomputedOldestXmin = GetOldestXmin(NULL,
> PROCARRAY_FLAGS_VACUUM);

The 0002 patch changed those files:

diff --git a/contrib/pg_visibility/pg_visibility.c 
b/contrib/pg_visibility/pg_visibility.c
index 68d580ed1e0..37206c50a21 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool 
all_frozen)
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
TransactionId OldestXmin = InvalidTransactionId;
 
-   if (all_visible)
-   {
-   /* Don't pass rel; that will fail in recovery. */
-   OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
-   }
-
rel = relation_open(relid, AccessShareLock);
 
/* Only some relkinds have a visibility map */
check_relation_relkind(rel);
 
+   if (all_visible)
+   OldestXmin = GetOldestNonRemovableTransactionId(rel);
+
nblocks = RelationGetNumberOfBlocks(rel);
 
/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool 
all_frozen)
 * From a concurrency point of view, it sort of 
sucks to
 * retake ProcArrayLock here while we're 
holding the buffer
 * exclusively locked, but it should be safe 
against
-* deadlocks, because surely GetOldestXmin() 
should never take
-* a buffer lock. And this shouldn't happen 
often, so it's
-* worth being careful so as to avoid false 
positives.
+* deadlocks, because surely 
GetOldestNonRemovableTransactionId()
+* should never take a buffer lock. And this 
shouldn't happen
+* often, so it's worth being careful so as to 
avoid false
+* positives.
 */
-   RecomputedOldestXmin = GetOldestXmin(NULL, 
PROCARRAY_FLAGS_VACUUM);
+   RecomputedOldestXmin = 
GetOldestNonRemovableTransactionId(rel);
 
if (!TransactionIdPrecedes(OldestXmin, 
RecomputedOldestXmin))
record_corrupt_item(items, 
_self);

diff --git a/contrib/pgstattuple/pgstatapprox.c 
b/contrib/pgstattuple/pgstatapprox.c
index dbc0fa11f61..3a99333d443 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
BufferAccessStrategy bstrategy;
TransactionId OldestXmin;
 
-   OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+   OldestXmin = GetOldestNonRemovableTransactionId(rel);
bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
nblocks = RelationGetNumberOfBlocks(rel);


Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-07-24 Thread Ranier Vilela
Em sex., 24 de jul. de 2020 às 14:16, Andres Freund 
escreveu:

> On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:
> > Latest Postgres
> > Windows 64 bits
> > msvc 2019 64 bits
> >
> > Patches applied v12-0001 to v12-0007:
> >
> >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning
> C4013:
> > 'GetOldestXmin' indefinido; assumindo extern retornando int
> > [C:\dll\postgres
> > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
> > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
> > [C:\dll\postgres\pg_visibility.
> > vcxproj]
> >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
> > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > [C:\dll\postgres\pgstattuple.vcxproj]
> >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
> > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > [C:\dll\postgres\pg_visibility.vcxproj]
> >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
> > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > [C:\dll\postgres\pg_visibility.vcxproj]
>
> I don't know that's about - there's no call to GetOldestXmin() in
> pgstatapprox and pg_visibility after patch 0002? And similarly, the
> PROCARRAY_* references are also removed in the same patch?
>
Maybe need to remove them from these places, not?
C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
File pgstattuple\pgstatapprox.c:
OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
File pg_visibility\pg_visibility.c:
OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
 * deadlocks, because surely
GetOldestXmin() should never take
RecomputedOldestXmin = GetOldestXmin(NULL,
PROCARRAY_FLAGS_VACUUM);

regards,
Ranier Vilela


Re: Improving connection scalability: GetSnapshotData()

2020-07-24 Thread Andres Freund
On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:
> Latest Postgres
> Windows 64 bits
> msvc 2019 64 bits
> 
> Patches applied v12-0001 to v12-0007:
> 
>  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013:
> 'GetOldestXmin' indefinido; assumindo extern retornando int
> [C:\dll\postgres
> C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
> C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
> [C:\dll\postgres\pg_visibility.
> vcxproj]
>  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
> 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> [C:\dll\postgres\pgstattuple.vcxproj]
>   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
> C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> [C:\dll\postgres\pg_visibility.vcxproj]
>   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
> C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> [C:\dll\postgres\pg_visibility.vcxproj]

I don't know that's about - there's no call to GetOldestXmin() in
pgstatapprox and pg_visibility after patch 0002? And similarly, the
PROCARRAY_* references are also removed in the same patch?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-07-24 Thread Ranier Vilela
Latest Postgres
Windows 64 bits
msvc 2019 64 bits

Patches applied v12-0001 to v12-0007:

 C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013:
'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres\pg_visibility.
vcxproj]
 C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pgstattuple.vcxproj]
  C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]
  C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]

regards,
Ranier Vilela


Re: Improving connection scalability: GetSnapshotData()

2020-07-15 Thread Alvaro Herrera
On 2020-Jul-15, Andres Freund wrote:

> It could make sense to split the conversion of
> VariableCacheData->latestCompletedXid to FullTransactionId out from 0001
> into is own commit. Not sure...

+1, the commit is large enough and that change can be had in advance.

Note you forward-declare struct GlobalVisState twice in heapam.h.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Improving connection scalability: GetSnapshotData()

2020-07-01 Thread Daniel Gustafsson
This patch no longer applies to HEAD, please submit a rebased version.  I've
marked it Waiting on Author in the meantime.

cheers ./daniel




Re: Improving connection scalability: GetSnapshotData()

2020-06-07 Thread Michail Nikolaev
Hello, hackers.
Andres, nice work!

Sorry for the off-top.

Some of my work [1] related to the support of index hint bits on
standby is highly interfering with this patch.
Is it safe to consider it committed and start rebasing on top of the patches?

Thanks,
Michail.

[1]: 
https://www.postgresql.org/message-id/CANtu0ojmkN_6P7CQWsZ%3DuEgeFnSmpCiqCxyYaHnhYpTZHj7Ubw%40mail.gmail.com




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Michael Paquier
On Wed, Apr 08, 2020 at 03:17:41PM -0700, Andres Freund wrote:
> On 2020-04-08 09:26:42 -0400, Jonathan S. Katz wrote:
>> Lastly, with the ongoing world events, perhaps time that could have been
>> dedicated to this and other patches likely affected their completion. I
>> know most things in my life take way longer than they used to (e.g.
>> taking out the trash/recycles has gone from a 15s to 240s routine). The
>> same could be said about other patches as well, but this one has a far
>> greater impact (a double-edged sword, of course) given it's a feature
>> that everyone uses in PostgreSQL ;)
> 
> I'm obviously not alone in that, so I agree that it's not an argument
> pro/con anything.
> 
> But this definitely is the case for me. Leaving aside the general dread,
> not having a quiet home-office, nor good exercise, is definitely not
> helping.

Another factor to be careful of is that by committing a new feature in
a release cycle, you actually need to think about the extra amount of
resources you may need to address comments and issues about it in time
during the beta/stability period, and that more care is likely needed
if you commit something at the end of the cycle.  On top of that,
currently, that's a bit hard to plan one or two weeks ahead if help is
needed to stabilize something you worked on.  I am pretty sure that
we'll be able to sort things out with a collective effort though.
--
Michael


signature.asc
Description: PGP signature


Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Bruce Momjian
On Wed, Apr  8, 2020 at 03:25:34PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-04-08 18:06:23 -0400, Bruce Momjian wrote:
> > If we don't commit this, where does this leave us with the
> > old_snapshot_threshold feature?  We remove it in back branches and have
> > no working version in PG 13?  That seems kind of bad.
> 
> I don't think this patch changes the situation for
> old_snapshot_threshold in a meaningful way.
> 
> Sure, this patch makes old_snapshot_threshold scale better, and triggers
> fewer unnecessary query cancellations.  But there still are wrong query
> results, the tests still don't test anything meaningful, and the
> determination of which query is cancelled is still wrong.

Oh, OK, so it still needs to be disabled.  I was hoping we could paint
this as a fix.

Based on Robert's analysis of the risk (no SQL syntax, no storage
changes), I think, if you are willing to keep working at this until the
final release, it is reasonable to apply it.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Andres Freund
Hi,

On 2020-04-08 18:06:23 -0400, Bruce Momjian wrote:
> If we don't commit this, where does this leave us with the
> old_snapshot_threshold feature?  We remove it in back branches and have
> no working version in PG 13?  That seems kind of bad.

I don't think this patch changes the situation for
old_snapshot_threshold in a meaningful way.

Sure, this patch makes old_snapshot_threshold scale better, and triggers
fewer unnecessary query cancellations.  But there still are wrong query
results, the tests still don't test anything meaningful, and the
determination of which query is cancelled is still wrong.

- Andres




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Andres Freund
Hi,

On 2020-04-08 09:44:16 -0400, Robert Haas wrote:
> Moreover, shakedown time will be minimal because we're so late in the
> release cycle

My impression increasingly is that there's very little actual shakedown
before beta :(. As e.g. evidenced by the fact that 2PC did basically not
work for several months until I did new benchmarks for this patch.

I don't know what to do about that, but...

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Andres Freund
Hi,

On 2020-04-08 09:26:42 -0400, Jonathan S. Katz wrote:
> On 4/8/20 8:59 AM, Alexander Korotkov wrote:
> > On Wed, Apr 8, 2020 at 3:43 PM Andres Freund  wrote:
> >> Realistically it still 2-3 hours of proof-reading.
> >>
> >> This makes me sad :(
> >
> > Can we ask RMT to extend feature freeze for this particular patchset?
> > I think it's reasonable assuming extreme importance of this patchset.

> One of the features of RMT responsibilities[1] is to be "hands off" as
> much as possible, so perhaps a reverse ask: how would people feel about
> this patch going into PG13, knowing that the commit would come after the
> feature freeze date?

I'm obviously biased, so I don't think there's much point in responding
directly to that question. But I thought it could be helpful if I
described what my thoughts about where the patchset is:

What made me not commit it "earlier" yesterday was not that I had/have
any substantial concerns about the technical details of the patch. But
that there were a few too many comments that didn't yet sound quite
right, that the commit messages didn't yet explain the architecture
/ benefits well enough, and that I noticed that a few variable names
were too easy to be misunderstood by others.

By 5 AM I had addressed most of that, except that some technical details
weren't yet mentioned in the commit messages ([1], they are documented
in the code). I also produce enough typos / odd grammar when fully
awake, so even though I did proof read my changes, I thought that I need
to do that again while awake.

There have been no substantial code changes since yesterday. The
variable renaming prompted by Robert (which I agree is an improvement),
as well as reducing the diff size by deferring some readability
improvements (probably also a good idea) did however produce quite a few
conflicts in subsequent patches that I needed to resolve. Another awake
read-through to confirm that I resolved them correctly seemed the
responsible thing to do before a commit.


> Lastly, with the ongoing world events, perhaps time that could have been
> dedicated to this and other patches likely affected their completion. I
> know most things in my life take way longer than they used to (e.g.
> taking out the trash/recycles has gone from a 15s to 240s routine). The
> same could be said about other patches as well, but this one has a far
> greater impact (a double-edged sword, of course) given it's a feature
> that everyone uses in PostgreSQL ;)

I'm obviously not alone in that, so I agree that it's not an argument
pro/con anything.

But this definitely is the case for me. Leaving aside the general dread,
not having a quiet home-office, nor good exercise, is definitely not
helping.

I'm really bummed that I didn't have the cycles to help the shared
memory stats patch ready as well. It's clearly not yet there (but
improved a lot during the CF). But it's been around for so long, and
there's so many improvements blocked by the current stats
infrastructure...


[1] the "mirroring" of values beteween dense arrays and PGPROC, the
changed locking regimen for ProcArrayAdd/Remove, the widening of
lastCompletedXid to be a 64bit xid
[2] 
https://www.postgresql.org/message-id/20200407121503.zltbpqmdesurflnm%40alap3.anarazel.de

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Bruce Momjian
On Wed, Apr  8, 2020 at 09:44:16AM -0400, Robert Haas wrote:
> I don't know what the right thing to do is. I agree with everyone who
> says this is a very important problem, and I have the highest respect
> for Andres's technical ability. On the other hand, I have been around
> here long enough to know that deciding whether to allow late commits
> on the basis of how much we like the feature is a bad plan, because it
> takes into account only the upside of a commit, and ignores the
> possible downside risk. Typically, the commit is late because the
> feature was rushed to completion at the last minute, which can have an
> effect on quality. I can say, having read through the patches
> yesterday, that they don't suck, but I can't say that they're fully
> correct. That's not to say that we shouldn't decide to take them, but
> it is a concern to be taken seriously. We have made mistakes before in
> what we shipped that had serious implications for many users and for
> the project; we should all be wary of making more such mistakes. I am
> not trying to say that solving problems and making stuff better is NOT
> important, just that every coin has two sides.

If we don't commit this, where does this leave us with the
old_snapshot_threshold feature?  We remove it in back branches and have
no working version in PG 13?  That seems kind of bad.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Andres Freund
Hi,

On 2020-04-08 09:24:13 -0400, Robert Haas wrote:
> On Tue, Apr 7, 2020 at 4:27 PM Andres Freund  wrote:
> > The main reason is that we want to be able to cheaply check the current
> > state of the variables (mostly when checking a backend's own state). We
> > can't access the "dense" ones without holding a lock, but we e.g. don't
> > want to make ProcArrayEndTransactionInternal() take a lock just to check
> > if vacuumFlags is set.
> >
> > It turns out to also be good for performance to have the copy for
> > another reason: The "dense" arrays share cachelines with other
> > backends. That's worth it because it allows to make GetSnapshotData(),
> > by far the most frequent operation, touch fewer cache lines. But it also
> > means that it's more likely that a backend's "dense" array entry isn't
> > in a local cpu cache (it'll be pulled out of there when modified in
> > another backend). In many cases we don't need the shared entry at commit
> > etc time though, we just need to check if it is set - and most of the
> > time it won't be.  The local entry allows to do that cheaply.
> >
> > Basically it makes sense to access the PGPROC variable when checking a
> > single backend's data, especially when we have to look at the PGPROC for
> > other reasons already.  It makes sense to look at the "dense" arrays if
> > we need to look at many / most entries, because we then benefit from the
> > reduced indirection and better cross-process cacheability.
> 
> That's a good explanation. I think it should be in the comments or a
> README somewhere.

I had a briefer version in the PROC_HDR comment. I've just expanded it
to:
 *
 * The denser separate arrays are beneficial for three main reasons: First, to
 * allow for as tight loops accessing the data as possible. Second, to prevent
 * updates of frequently changing data (e.g. xmin) from invalidating
 * cachelines also containing less frequently changing data (e.g. xid,
 * vacuumFlags). Third to condense frequently accessed data into as few
 * cachelines as possible.
 *
 * There are two main reasons to have the data mirrored between these dense
 * arrays and PGPROC. First, as explained above, a PGPROC's array entries can
 * only be accessed with either ProcArrayLock or XidGenLock held, whereas the
 * PGPROC entries do not require that (obviously there may still be locking
 * requirements around the individual field, separate from the concerns
 * here). That is particularly important for a backend to efficiently checks
 * it own values, which it often can safely do without locking.  Second, the
 * PGPROC fields allow to avoid unnecessary accesses and modification to the
 * dense arrays. A backend's own PGPROC is more likely to be in a local cache,
 * whereas the cachelines for the dense array will be modified by other
 * backends (often removing it from the cache for other cores/sockets). At
 * commit/abort time a check of the PGPROC value can avoid accessing/dirtying
 * the corresponding array value.
 *
 * Basically it makes sense to access the PGPROC variable when checking a
 * single backend's data, especially when already looking at the PGPROC for
 * other reasons already.  It makes sense to look at the "dense" arrays if we
 * need to look at many / most entries, because we then benefit from the
 * reduced indirection and better cross-process cache-ability.
 *
 * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), the data
 * in the dense arrays is initialized from the PGPROC while it already holds

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Robert Haas
On Wed, Apr 8, 2020 at 9:27 AM Jonathan S. Katz  wrote:
> One of the features of RMT responsibilities[1] is to be "hands off" as
> much as possible, so perhaps a reverse ask: how would people feel about
> this patch going into PG13, knowing that the commit would come after the
> feature freeze date?

Letting something be committed after feature freeze, or at any other
time, is just a risk vs. reward trade-off. Every patch carries some
chance of breaking stuff or making things worse. And every patch has a
chance of making something better that people care about.

On general principle, I would categorize this as a moderate-risk
patch. It doesn't change SQL syntax like, e.g. MERGE, nor does it
touch the on-disk format, like, e.g. INSERT .. ON CONFLICT UPDATE. The
changes are relatively localized, unlike, e.g. parallel query. Those
are all things that reduce risk. On the other hand, it's a brand new
patch which has not been thoroughly reviewed by anyone. Moreover,
shakedown time will be minimal because we're so late in the release
cycle. if it has subtle synchronization problems or if it regresses
performance badly in some cases, we might not find out about any of
that until after release. While in theory we could revert it any time,
since no SQL syntax or on-disk format is affected, in practice it will
be difficult to do that if it's making life better for some people and
worse for others.

I don't know what the right thing to do is. I agree with everyone who
says this is a very important problem, and I have the highest respect
for Andres's technical ability. On the other hand, I have been around
here long enough to know that deciding whether to allow late commits
on the basis of how much we like the feature is a bad plan, because it
takes into account only the upside of a commit, and ignores the
possible downside risk. Typically, the commit is late because the
feature was rushed to completion at the last minute, which can have an
effect on quality. I can say, having read through the patches
yesterday, that they don't suck, but I can't say that they're fully
correct. That's not to say that we shouldn't decide to take them, but
it is a concern to be taken seriously. We have made mistakes before in
what we shipped that had serious implications for many users and for
the project; we should all be wary of making more such mistakes. I am
not trying to say that solving problems and making stuff better is NOT
important, just that every coin has two sides.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Jonathan S. Katz
On 4/8/20 8:59 AM, Alexander Korotkov wrote:
> On Wed, Apr 8, 2020 at 3:43 PM Andres Freund  wrote:
>> Realistically it still 2-3 hours of proof-reading.
>>
>> This makes me sad :(
> 
> Can we ask RMT to extend feature freeze for this particular patchset?
> I think it's reasonable assuming extreme importance of this patchset.

One of the features of RMT responsibilities[1] is to be "hands off" as
much as possible, so perhaps a reverse ask: how would people feel about
this patch going into PG13, knowing that the commit would come after the
feature freeze date?

My 2¢, with RMT hat off:

As mentioned earlier[2], we know that connection scalability is a major
pain point with PostgreSQL and any effort that can help alleviate that
is a huge win, even in incremental gains. Andres et al experimentation
show that this is more than incremental gains, and will certainly make a
huge difference in people's PostgreSQL experience. It is one of those
features where you can "plug in and win" -- you get a performance
benefit just by upgrading. That is not insignificant.

However, I also want to ensure that we are fair: in the past there have
also been other patches that have been "oh-so-close" to commit before
feature freeze but have not made it in (an example escapes me at the
moment). Therefore, we really need to have consensus among ourselves
that the right decision is to allow this to go in after feature freeze.

Did this come in (very) late into the development cycle? Yes, and I
think normally that's enough to give cause for pause. But I could also
argue that Andres is fixing a "bug" with PostgreSQL (probably several
bugs ;) with PostgreSQL -- and perhaps the fixes can't be backpatched
per se, but they do improve the overall stability and usability of
PostgreSQL and it'd be a shame if we have to wait on them.

Lastly, with the ongoing world events, perhaps time that could have been
dedicated to this and other patches likely affected their completion. I
know most things in my life take way longer than they used to (e.g.
taking out the trash/recycles has gone from a 15s to 240s routine). The
same could be said about other patches as well, but this one has a far
greater impact (a double-edged sword, of course) given it's a feature
that everyone uses in PostgreSQL ;)

So with my RMT hat off, I say +1 to allowing this post feature freeze,
though within a reasonable window.

My 2¢, with RMT hat on:

I believe in[2] I outlined a way a path for including the patch even at
this stage in the game. If it is indeed committed, I think we
immediately put it on the "Recheck a mid-Beta" list. I know it's not as
trivial to change as something like "Determine if jit="on" by default"
(not picking on Andres, I just remember that example from RMT 11), but
it at least provides a notable reminder that we need to ensure we test
this thoroughly, and point people to really hammer it during beta.

So with my RMT hat on, I say +0 but with a ;)

Thanks,

Jonathan

[1] https://wiki.postgresql.org/wiki/Release_Management_Team#History
[2]
https://www.postgresql.org/message-id/6be8c321-68ea-a865-d8d0-50a3af616463%40postgresql.org



signature.asc
Description: OpenPGP digital signature


Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Robert Haas
On Tue, Apr 7, 2020 at 4:27 PM Andres Freund  wrote:
> The main reason is that we want to be able to cheaply check the current
> state of the variables (mostly when checking a backend's own state). We
> can't access the "dense" ones without holding a lock, but we e.g. don't
> want to make ProcArrayEndTransactionInternal() take a lock just to check
> if vacuumFlags is set.
>
> It turns out to also be good for performance to have the copy for
> another reason: The "dense" arrays share cachelines with other
> backends. That's worth it because it allows to make GetSnapshotData(),
> by far the most frequent operation, touch fewer cache lines. But it also
> means that it's more likely that a backend's "dense" array entry isn't
> in a local cpu cache (it'll be pulled out of there when modified in
> another backend). In many cases we don't need the shared entry at commit
> etc time though, we just need to check if it is set - and most of the
> time it won't be.  The local entry allows to do that cheaply.
>
> Basically it makes sense to access the PGPROC variable when checking a
> single backend's data, especially when we have to look at the PGPROC for
> other reasons already.  It makes sense to look at the "dense" arrays if
> we need to look at many / most entries, because we then benefit from the
> reduced indirection and better cross-process cacheability.

That's a good explanation. I think it should be in the comments or a
README somewhere.

> How about:
> /*
>  * If the current xactCompletionCount is still the same as it was at 
> the
>  * time the snapshot was built, we can be sure that rebuilding the
>  * contents of the snapshot the hard way would result in the same 
> snapshot
>  * contents:
>  *
>  * As explained in transam/README, the set of xids considered running 
> by
>  * GetSnapshotData() cannot change while ProcArrayLock is held. 
> Snapshot
>  * contents only depend on transactions with xids and 
> xactCompletionCount
>  * is incremented whenever a transaction with an xid finishes (while
>  * holding ProcArrayLock) exclusively). Thus the xactCompletionCount 
> check
>  * ensures we would detect if the snapshot would have changed.
>  *
>  * As the snapshot contents are the same as it was before, it is is 
> safe
>  * to re-enter the snapshot's xmin into the PGPROC array. None of the 
> rows
>  * visible under the snapshot could already have been removed (that'd
>  * require the set of running transactions to change) and it fulfills 
> the
>  * requirement that concurrent GetSnapshotData() calls yield the same
>  * xmin.
>  */

That's nice and clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-08 Thread Alexander Korotkov
On Wed, Apr 8, 2020 at 3:43 PM Andres Freund  wrote:
> Realistically it still 2-3 hours of proof-reading.
>
> This makes me sad :(

Can we ask RMT to extend feature freeze for this particular patchset?
I think it's reasonable assuming extreme importance of this patchset.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 16:13:07 -0400, Robert Haas wrote:
> On Tue, Apr 7, 2020 at 3:24 PM Andres Freund  wrote:
> > > + ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
> > >
> > > Apparently this array is not dense in the sense that it excludes
> > > unused slots, but comments elsewhere don't seem to entirely agree.
> >
> > What do you mean with "unused slots"? Backends that committed?
> 
> Backends that have no XID. You mean, I guess, that it is "dense" in
> the sense that only live backends are in there, not "dense" in the
> sense that only active write transactions are in there.

Correct.

I tried the "only active write transaction" approach, btw, and had a
hard time making it scale well (due to the much more frequent moving of
entries at commit/abort time).  If we were to go to a 'only active
transactions' array at some point we'd imo still need pretty much all
the other changes made here - so I'm not worried about it for now.


> > > + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + 
> > > max_prepared_xacts;
> > >
> > >   /* ProcGlobal */
> > >   size = add_size(size, sizeof(PROC_HDR));
> > > - /* MyProcs, including autovacuum workers and launcher */
> > > - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
> > > - /* AuxiliaryProcs */
> > > - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
> > > - /* Prepared xacts */
> > > - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
> > > - /* ProcStructLock */
> > > + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
> > >
> > > This seems like a bad idea. If we establish a precedent that it's OK
> > > to have sizing routines that don't use add_size() and mul_size(),
> > > people are going to cargo cult that into places where there is more
> > > risk of overflow than there is here.
> >
> > Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
> > would overflow due to too big MaxBackends or max_prepared_xacts? The
> > multiplication itself is still protected by add_size(). It didn't seem
> > correct to use add_size for the TotalProcs addition, since that's not
> > really a size. And since the limit for procs is much lower than
> > UINT32_MAX...
> 
> I'm concerned that there are 0 uses of add_size in any shared-memory
> sizing function, and I think it's best to keep it that way.

I can't make sense of that sentence?


We already have code like this, and have for a long time:
/* Size of the ProcArray structure itself */
#define PROCARRAY_MAXPROCS  (MaxBackends + max_prepared_xacts)

adding NUM_AUXILIARY_PROCS doesn't really change that, does it?


> If you initialize TotalProcs = add_size(MaxBackends,
> add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy.

Will do.


Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 15:26:36 -0400, Robert Haas wrote:
> 0008 -
> 
> Here again, I greatly dislike putting Copy in the name. It makes
> little sense to pretend that either is the original and the other is
> the copy. You just have the same data in two places. If one of them is
> more authoritative, the place to explain that is in the comments, not
> by elongating the structure member name and supposing anyone will be
> able to make something of that.

Ok.



> 0009, 0010 -
> 
> I think you've got this whole series of things divided up too finely.
> Like, 0005 feels like the meat of it, and that has a bunch of things
> in it that could plausible be separated out as separate commits. 0007
> also seems to do more than one kind of thing (see my comment regarding
> moving some of that into 0006). But whacking everything around like a
> crazy man in 0005 and a little more in 0007 and then doing the
> following cleanup in these little tiny steps seems pretty lame.
> Separating 0009 from 0010 is maybe the clearest example of that, but
> IMHO it's pretty unclear why both of these shouldn't be merged with
> 0008.

I found it a *lot* easier to review / evolve them this way. I e.g. had
an earlier version in which the subxid part of the change worked
substantially differently (it tried to elide the overflowed bool, by
definining -1 as the indicator for overflows), and it'd been way harder
to change that if I didn't have a patch with *just* the subxid changes.

I'd not push them separated by time, but I do think it'd make sense to
push them as separate commits. I think it's easier to review them in
case of a bug in a separate area.


> My comments on the Copy naming apply here as well. I am also starting
> to wonder why exactly we need two copies of all this stuff. Perhaps
> I've just failed to absorb the idea for having read the patch too
> briefly, but I think that we need to make sure that it's super-clear
> why we're doing that. If we just needed it for one field because
> $REASONS, that would be one thing, but if we need it for all of them
> then there must be some underlying principle here that needs a good
> explanation in an easy-to-find and centrally located place.

The main reason is that we want to be able to cheaply check the current
state of the variables (mostly when checking a backend's own state). We
can't access the "dense" ones without holding a lock, but we e.g. don't
want to make ProcArrayEndTransactionInternal() take a lock just to check
if vacuumFlags is set.

It turns out to also be good for performance to have the copy for
another reason: The "dense" arrays share cachelines with other
backends. That's worth it because it allows to make GetSnapshotData(),
by far the most frequent operation, touch fewer cache lines. But it also
means that it's more likely that a backend's "dense" array entry isn't
in a local cpu cache (it'll be pulled out of there when modified in
another backend). In many cases we don't need the shared entry at commit
etc time though, we just need to check if it is set - and most of the
time it won't be.  The local entry allows to do that cheaply.

Basically it makes sense to access the PGPROC variable when checking a
single backend's data, especially when we have to look at the PGPROC for
other reasons already.  It makes sense to look at the "dense" arrays if
we need to look at many / most entries, because we then benefit from the
reduced indirection and better cross-process cacheability.


> 0011 -
> 
> + * Number of top-level transactions that completed in some form since the
> + * start of the server. This currently is solely used to check whether
> + * GetSnapshotData() needs to recompute the contents of the snapshot, or
> + * not. There are likely other users of this.  Always above 1.
> 
> Does it only count XID-bearing transactions? If so, best mention that.

Oh, good point.


> +GetSnapshotDataFillTooOld(Snapshot snapshot)
> 
> Uh... no clue what's going on here. Granted the code had no comments
> in the old place either, so I guess it's not worse, but even the name
> of the new function is pretty incomprehensible.

It fills the old_snapshot_threshold fields of a Snapshot.


> + * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go
> 
> I know what it means to re-enter a building, but I don't know what it
> means to re-enter the snapshot's xmin.

Re-entering it into the procarray, thereby preventing rows that the
snapshot could see from being removed.

> This whole comment seems a bit murky.

How about:
/*
 * If the current xactCompletionCount is still the same as it was at the
 * time the snapshot was built, we can be sure that rebuilding the
 * contents of the snapshot the hard way would result in the same 
snapshot
 * contents:
 *
 * As explained in transam/README, the set of xids considered running by
 * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot
 * 

Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 3:24 PM Andres Freund  wrote:
> > 0007 -
> >
> > + TransactionId xidCopy; /* this backend's xid, a copy of this proc's
> > +ProcGlobal->xids[] entry. */
> >
> > Can we please NOT put Copy into the name like that? Pretty please?
>
> Do you have a suggested naming scheme? Something indicating that it's
> not the only place that needs to be updated?

I don't think trying to indicate that in the structure member names is
a useful idea. I think you should give them the same names, maybe with
an "s" to pluralize the copy hanging off of ProcGlobal, and put a
comment that says something like:

We keep two copies of each of the following three fields. One copy is
here in the PGPROC, and the other is in a more densely-packed array
hanging off of PGXACT. Both copies of the value must always be updated
at the same time and under the same locks, so that it is always the
case that MyProc->xid == ProcGlobal->xids[MyProc->pgprocno] and
similarly for vacuumFlags and WHATEVER. Note, however, that the arrays
attached to ProcGlobal only contain entries for PGPROC structures that
are currently part of the ProcArray (i.e. there is currently a backend
for that PGPROC). We use those arrays when STUFF and the copies in the
individual PGPROC when THINGS.

> I think it's more on-point here, because we need to hold either of the
> locks* even, for changes to a backend's own status that one reasonably
> could expect would be safe to at least inspect.

It's just too brief and obscure to be useful.

> > + ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
> >
> > Apparently this array is not dense in the sense that it excludes
> > unused slots, but comments elsewhere don't seem to entirely agree.
>
> What do you mean with "unused slots"? Backends that committed?

Backends that have no XID. You mean, I guess, that it is "dense" in
the sense that only live backends are in there, not "dense" in the
sense that only active write transactions are in there. It would be
nice to nail that down better; the wording I suggested above might
help.

> > + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + 
> > max_prepared_xacts;
> >
> >   /* ProcGlobal */
> >   size = add_size(size, sizeof(PROC_HDR));
> > - /* MyProcs, including autovacuum workers and launcher */
> > - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
> > - /* AuxiliaryProcs */
> > - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
> > - /* Prepared xacts */
> > - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
> > - /* ProcStructLock */
> > + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));
> >
> > This seems like a bad idea. If we establish a precedent that it's OK
> > to have sizing routines that don't use add_size() and mul_size(),
> > people are going to cargo cult that into places where there is more
> > risk of overflow than there is here.
>
> Hm. I'm not sure I see the problem. Are you concerned that TotalProcs
> would overflow due to too big MaxBackends or max_prepared_xacts? The
> multiplication itself is still protected by add_size(). It didn't seem
> correct to use add_size for the TotalProcs addition, since that's not
> really a size. And since the limit for procs is much lower than
> UINT32_MAX...

I'm concerned that there are 0 uses of add_size in any shared-memory
sizing function, and I think it's best to keep it that way. If you
initialize TotalProcs = add_size(MaxBackends,
add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy. I
think it's a desperately bad idea to imagine that we can dispense with
overflow checks here and be safe. It's just too easy for that to
become false due to future code changes, or get copied to other places
where it's unsafe now.

> > You've got a bunch of different places that talk about the new PGXACT
> > array and they are somewhat redundant yet without saying exactly the
> > same thing every time either. I think that needs cleanup.
>
> Could you point out a few of those comments, I'm not entirely sure which
> you're talking about?

+ /*
+ * Also allocate a separate arrays for data that is frequently (e.g. by
+ * GetSnapshotData()) accessed from outside a backend.  There is one entry
+ * in each for every *live* PGPROC entry, and they are densely packed so
+ * that the first procArray->numProc entries are all valid.  The entries
+ * for a PGPROC in those arrays are at PGPROC->pgxactoff.
+ *
+ * Note that they may not be accessed without ProcArrayLock held! Upon
+ * ProcArrayRemove() later entries will be moved.
+ *
+ * These are separate from the main PGPROC array so that the most heavily
+ * accessed data is stored contiguously in memory in as few cache lines as
+ * possible. This provides significant performance benefits, especially on
+ * a multiprocessor system.
+ */

+ * Arrays with per-backend information that is hotly accessed, indexed by
+ * PGPROC->pgxactoff. These are in separate arrays for three reasons:
+ * First, to 

Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 3:31 PM Andres Freund  wrote:
> Well, it *is* only a vague test :). It shouldn't ever have a false
> positive, but there's plenty chance for false negatives (if wrapped
> around far enough).

Sure, but I think you get my point. Asserting that something "might
be" true isn't much of an assertion. Saying that it's in the correct
range is not to say there can't be a problem - but we're saying that
it IS in the expect range, not that it may or may not be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 15:03:46 -0400, Robert Haas wrote:
> On Tue, Apr 7, 2020 at 1:51 PM Andres Freund  wrote:
> > > ComputedHorizons seems like a fairly generic name. I think there's
> > > some relationship between InvisibleToEveryoneState and
> > > ComputedHorizons that should be brought out more clearly by the naming
> > > and the comments.
> >
> > I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
> > much... But I find it hard to come up with something that's meaningfully
> > better.
> 
> It would help to stick XID in there, like ComputedXIDHorizons. What I
> find really baffling is that you seem to have two structures in the
> same file that have essentially the same purpose, but the second one
> (ComputedHorizons) has a lot more stuff in it. I can't understand why.

ComputedHorizons are the various "accurate" horizons computed by
ComputeTransactionHorizons(). That's used to determine a horizon for
vacuuming (via GetOldestVisibleTransactionId()) and other similar use
cases.

The various InvisibleToEveryoneState variables contain the boundary
based horizons, and are updated / initially filled by
GetSnapshotData(). When the a tested value falls between the boundaries,
we update the approximate boundaries using
ComputeTransactionHorizons(). That briefly makes the boundaries in
the InvisibleToEveryoneState accurate - but future GetSnapshotData()
calls will increase the definitely_needed_bound (if transactions
committed since).

The ComputedHorizons fields could instead just be pointer based
arguments to ComputeTransactionHorizons(), but that seems clearly
worse.

I'll change ComputedHorizons's comment to say that it's the result of
ComputeTransactionHorizons(), not the "state".


> > What's the inconsistency? The dropped replication_ vs dropped FullXid
> > postfix?
> 
> Yeah, just having the member names be randomly different between the
> structs. Really harms greppability.

The long names make it hard to keep line lengths in control, in
particular when also involving the long macro names for TransactionId /
FullTransactionId comparators...


> > > Generally, heap_prune_satisfies_vacuum looks pretty good. The
> > > limited_oldest_committed naming is confusing, but the comments make it
> > > a lot clearer.
> >
> > I didn't like _committed much either. But couldn't come up with
> > something short. _relied_upon?
> 
> oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed?

Will go for old_snapshot_limit_used, and rename the other variables to
old_snapshot_limit_ts, old_snapshot_limit_xmin.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 14:51:52 -0400, Robert Haas wrote:
> On Tue, Apr 7, 2020 at 2:28 PM Andres Freund  wrote:
> > Does that make some sense? Do you have a better suggestion for a name?
> 
> I think it makes sense. I have two basic problems with the name. The
> first is that "on disk" doesn't seem to be a very clear way of
> describing what you're actually checking here, and it definitely
> doesn't refer to an existing concept which sophisticated hackers can
> be expected to understand. The second is that "may" is ambiguous in
> English: it can either mean that something is permissible ("Johnny,
> you may go to the bathroom") or that we do not have certain knowledge
> of it ("Johnny may be in the bathroom"). When it is followed by "be",
> it usually has the latter sense, although there are exceptions (e.g.
> "She may be discharged from the hospital today if she wishes, but we
> recommend that she stay for another day"). Consequently, I found that
> use of "may be" in this context wicked confusing.

Well, it *is* only a vague test :). It shouldn't ever have a false
positive, but there's plenty chance for false negatives (if wrapped
around far enough).


> So I suggest a name with "Is" or no verb, rather than one with
> "MayBe." And I suggest something else instead of "OnDisk," e.g.
> AssertTransactionIdIsInUsableRange() or
> TransactionIdIsInAllowableRange() or
> AssertTransactionIdWraparoundProtected(). I kind of like that last
> one, but YMMV.

Make sense - but they all seem to express a bit more certainty than I
think the test actually provides.

I explicitly did not want (and added a comment to that affect) have
something like TransactionIdIsInAllowableRange(), because there never
can be a safe use of its return value, as far as I can tell.

The "OnDisk" was intended to clarify that the range it verifies is
whether it'd be ok for the xid to have been found in a relation.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
0008 -

Here again, I greatly dislike putting Copy in the name. It makes
little sense to pretend that either is the original and the other is
the copy. You just have the same data in two places. If one of them is
more authoritative, the place to explain that is in the comments, not
by elongating the structure member name and supposing anyone will be
able to make something of that.

+ *
+ * XXX: That's why this is using vacuumFlagsCopy.

I am not sure there's any problem with the code that needs fixing
here, so I might think about getting rid of this XXX. But this gets
back to my complaint about the locking regime being unclear. What I
think you need to do here is rephrase the previous paragraph so that
it explains the reason for using this copy a bit better. Like "We read
the copy of vacuumFlags from PGPROC rather than visiting the copy
attached to ProcGlobal because we can do that without taking a lock.
See fuller explanation in ." Or whatever.

0009, 0010 -

I think you've got this whole series of things divided up too finely.
Like, 0005 feels like the meat of it, and that has a bunch of things
in it that could plausible be separated out as separate commits. 0007
also seems to do more than one kind of thing (see my comment regarding
moving some of that into 0006). But whacking everything around like a
crazy man in 0005 and a little more in 0007 and then doing the
following cleanup in these little tiny steps seems pretty lame.
Separating 0009 from 0010 is maybe the clearest example of that, but
IMHO it's pretty unclear why both of these shouldn't be merged with
0008.

To be clear, I exaggerate for effect. 0005 is not whacking everything
around like a crazy man. But it is a non-minimal patch, whereas I
consider 0009 and 0010 to be sub-minimal.

My comments on the Copy naming apply here as well. I am also starting
to wonder why exactly we need two copies of all this stuff. Perhaps
I've just failed to absorb the idea for having read the patch too
briefly, but I think that we need to make sure that it's super-clear
why we're doing that. If we just needed it for one field because
$REASONS, that would be one thing, but if we need it for all of them
then there must be some underlying principle here that needs a good
explanation in an easy-to-find and centrally located place.

0011 -

+ * Number of top-level transactions that completed in some form since the
+ * start of the server. This currently is solely used to check whether
+ * GetSnapshotData() needs to recompute the contents of the snapshot, or
+ * not. There are likely other users of this.  Always above 1.

Does it only count XID-bearing transactions? If so, best mention that.

+ * transactions completed since the last GetSnapshotData()..

Too many periods.

+ /* Same with CSN */
+ ShmemVariableCache->xactCompletionCount++;

If I didn't know that CSN stood for commit sequence number from
reading years of mailing list traffic, I'd be lost here. So I think
this comment shouldn't use that term.

+GetSnapshotDataFillTooOld(Snapshot snapshot)

Uh... no clue what's going on here. Granted the code had no comments
in the old place either, so I guess it's not worse, but even the name
of the new function is pretty incomprehensible.

+ * Helper function for GetSnapshotData() that check if the bulk of the

checks

+ * the fields that need to change and returns true. false is returned
+ * otherwise.

Otherwise, it returns false.

+ * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go

I know what it means to re-enter a building, but I don't know what it
means to re-enter the snapshot's xmin.

This whole comment seems a bit murky.

...Robert




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 14:28:09 -0400, Robert Haas wrote:
> More review, since it sounds like you like it:
>
> 0006 - Boring. But I'd probably make this move both xmin and xid back,
> with related comment changes; see also next comment.
>
> 0007 -
>
> + TransactionId xidCopy; /* this backend's xid, a copy of this proc's
> +ProcGlobal->xids[] entry. */
>
> Can we please NOT put Copy into the name like that? Pretty please?

Do you have a suggested naming scheme? Something indicating that it's
not the only place that needs to be updated?


> + int pgxactoff; /* offset into various ProcGlobal-> arrays
> + * NB: can change any time unless locks held!
> + */
>
> I'm going to add the helpful comment "NB: can change any time unless
> locks held!" to every data structure in the backend that is in shared
> memory and not immutable. No need, of course, to mention WHICH
> locks...

I think it's more on-point here, because we need to hold either of the
locks* even, for changes to a backend's own status that one reasonably
could expect would be safe to at least inspect. E.g looking at
ProcGlobal->xids[MyProc->pgxactoff]
doesn't look suspicious, but could very well return another backends
xid, if neither ProcArrayLock nor XidGenLock is held (because a
ProcArrayRemove() could have changed pgxactoff if a previous entry was
removed).

*see comment at PROC_HDR:

 *
 * Adding/Removing an entry into the procarray requires holding *both*
 * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are
 * needed because the dense arrays (see below) are accessed from
 * GetNewTransactionId() and GetSnapshotData(), and we don't want to add
 * further contention by both using one lock. Adding/Removing a procarray
 * entry is much less frequent.
 */
typedef struct PROC_HDR
{
/* Array of PGPROC structures (not including dummies for prepared txns) 
*/
PGPROC *allProcs;


/*
 * Arrays with per-backend information that is hotly accessed, indexed 
by
 * PGPROC->pgxactoff. These are in separate arrays for three reasons:
 * First, to allow for as tight loops accessing the data as
 * possible. Second, to prevent updates of frequently changing data from
 * invalidating cachelines shared with less frequently changing
 * data. Third to condense frequently accessed data into as few 
cachelines
 * as possible.
 *
 * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), 
those
 * copies are used to provide the contents of the dense data, and will 
be
 * transferred by ProcArrayAdd() while it already holds ProcArrayLock.
 */

there's also

 * The various *Copy fields are copies of the data in ProcGlobal arrays that
 * can be accessed without holding ProcArrayLock / XidGenLock (see PROC_HDR
 * comments).


I had a more explicit warning/explanation about the dangers of accessing
the arrays without locks, but apparently went to far when reducing
duplicated comments.


> On a related note, PROC_HDR really, really, really needs comments
> explaining the locking regimen for the new xids field.


I'll expand the above, in particular highlighting the danger of
pgxactoff changing.


> + ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
>
> Apparently this array is not dense in the sense that it excludes
> unused slots, but comments elsewhere don't seem to entirely agree.

What do you mean with "unused slots"? Backends that committed?

Dense is intended to describe that the arrays only contain currently
"live" entries. I.e. that the first procArray->numProcs entries in each
array have the data for all procs (including prepared xacts) that are
"connected".  This is extending the concept that already existed for
procArray->pgprocnos.

Wheras the PGPROC/PGXACT arrays have "unused" entries interspersed.

This is what previously lead to the slow loop in GetSnapshotData(),
where we had to iterate over PGXACTs over an indirection in
procArray->pgprocnos. I.e. to only look at in-use PGXACTs we had to go
through allProcs[arrayP->pgprocnos[i]], which is, uh, suboptimal for
a performance critical routine holding a central lock.

I'll try to expand the comments around dense, but if you have a better
descriptor.


> Maybe the comments discussing how it is "dense" need to be a little
> more precise about this.
>
> + for (int i = 0; i < nxids; i++)
>
> I miss my C89. Yeah, it's just me.

Oh, dear god. I hate declaring variables like 'i' on function scope. The
bug that haunted me the longest in the development of this patch was in
XidCacheRemoveRunningXids, where there are both i and j, and a macro
XidCacheRemove(i), but the macro gets passed j as i.


> - if (!suboverflowed)
> + if (suboverflowed)
> + continue;
> +
>
> Do we really need to do this kind of diddling in this patch? I mean
> yes to the idea, but no to things that are going to make it harder to
> understand what happened if this blows up.

I can try to reduce 

Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 1:51 PM Andres Freund  wrote:
> > ComputedHorizons seems like a fairly generic name. I think there's
> > some relationship between InvisibleToEveryoneState and
> > ComputedHorizons that should be brought out more clearly by the naming
> > and the comments.
>
> I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
> much... But I find it hard to come up with something that's meaningfully
> better.

It would help to stick XID in there, like ComputedXIDHorizons. What I
find really baffling is that you seem to have two structures in the
same file that have essentially the same purpose, but the second one
(ComputedHorizons) has a lot more stuff in it. I can't understand why.

> What's the inconsistency? The dropped replication_ vs dropped FullXid
> postfix?

Yeah, just having the member names be randomly different between the
structs. Really harms greppability.

> > Generally, heap_prune_satisfies_vacuum looks pretty good. The
> > limited_oldest_committed naming is confusing, but the comments make it
> > a lot clearer.
>
> I didn't like _committed much either. But couldn't come up with
> something short. _relied_upon?

oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed?

> It's just adjusting for the changed name of latestCompletedXid to
> latestCompletedFullXid, as part of widening it to 64bits.  I'm not
> really a fan of adding that to the variable name, but surrounding code
> already did it (cf VariableCache->nextFullXid), so I thought I'd follow
> suit.

Oops, that was me misreading the diff. Sorry for the noise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
On Tue, Apr 7, 2020 at 2:28 PM Andres Freund  wrote:
> Does that make some sense? Do you have a better suggestion for a name?

I think it makes sense. I have two basic problems with the name. The
first is that "on disk" doesn't seem to be a very clear way of
describing what you're actually checking here, and it definitely
doesn't refer to an existing concept which sophisticated hackers can
be expected to understand. The second is that "may" is ambiguous in
English: it can either mean that something is permissible ("Johnny,
you may go to the bathroom") or that we do not have certain knowledge
of it ("Johnny may be in the bathroom"). When it is followed by "be",
it usually has the latter sense, although there are exceptions (e.g.
"She may be discharged from the hospital today if she wishes, but we
recommend that she stay for another day"). Consequently, I found that
use of "may be" in this context wicked confusing. What came to mind
was:

bool
RobertMayBeAGiraffe(void)
{
return true; // i mean, i haven't seen him since last week, so who knows?
}

So I suggest a name with "Is" or no verb, rather than one with
"MayBe." And I suggest something else instead of "OnDisk," e.g.
AssertTransactionIdIsInUsableRange() or
TransactionIdIsInAllowableRange() or
AssertTransactionIdWraparoundProtected(). I kind of like that last
one, but YMMV.

I wish to clarify that in sending these review emails I am taking no
position on whether or not it is prudent to commit any or all of them.
I do not think we can rule out the possibility that they will Break
Things, but neither do I wish to be seen as That Guy Who Stands In The
Way of Important Improvements. Time does not permit me a detailed
review anyway. So, these comments are provided in the hope that they
may be useful but without endorsement or acrimony. If other people
want to endorse or, uh, acrimoniate, based on my comments, that is up
to them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Peter Geoghegan
On Tue, Apr 7, 2020 at 11:28 AM Andres Freund  wrote:
> There is a lot of code that is pretty unsafe around wraparounds... They
> are getting easier and easier to hit on a regular schedule in production
> (plenty of databases that hit wraparounds multiple times a week). And I
> don't think we as PG developers often don't quite take that into
> account.

It would be nice if there was high level documentation on wraparound
hazards. Maybe even a dedicated README file.

-- 
Peter Geoghegan




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

On 2020-04-07 10:51:12 -0700, Andres Freund wrote:
> > +void AssertTransactionIdMayBeOnDisk(TransactionId xid)
> > 
> > Formatting.
> > 
> > + * Assert that xid is one that we could actually see on disk.
> > 
> > I don't know what this means. The whole purpose of this routine is
> > very unclear to me.
> 
> It's intended to be a double check against

forgetting things...? Err:

It is intended to make it easier to detect cases where the passed
TransactionId is not safe against wraparound. If there is protection
against wraparound, then the xid

a) may never be older than ShmemVariableCache->oldestXid (since
   otherwise the rel/datfrozenxid could not have advanced past the xid,
   and because oldestXid is what what prevents ->nextFullXid from
   advancing far enough to cause a wraparound)

b) cannot be >= ShmemVariableCache->nextFullXid. If it is, it cannot
   recently have come from GetNewTransactionId(), and thus there is no
   anti-wraparound protection either.

As full wraparounds are painful to exercise in testing,
AssertTransactionIdMayBeOnDisk() is intended to make it easier to detect
potential hazards.

The reason for the *OnDisk naming is that [oldestXid, nextFullXid) is
the appropriate check for values actually stored in tables. There could,
and probably should, be a narrower assertion ensuring that a xid is
protected against being pruned away (i.e. a PGPROC's xmin covering it).

The reason for being concerned enough in the new code to add the new
assertion helper (as well as a major motivating reason for making the
horizons 64 bit xids) is that it's much harder to ensure that "global
xmin" style horizons don't wrap around. By definition they include other
backend's ->xmin, and those can be released without a lock at any
time. As a lot of wraparound issues are triggered by very longrunning
transactions, it is not even unlikely to hit such problems: At some
point somebody is going to kill that old backend and ->oldestXid will
advance very quickly.

There is a lot of code that is pretty unsafe around wraparounds... They
are getting easier and easier to hit on a regular schedule in production
(plenty of databases that hit wraparounds multiple times a week). And I
don't think we as PG developers often don't quite take that into
account.


Does that make some sense? Do you have a better suggestion for a name?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
More review, since it sounds like you like it:

0006 - Boring. But I'd probably make this move both xmin and xid back,
with related comment changes; see also next comment.

0007 -

+ TransactionId xidCopy; /* this backend's xid, a copy of this proc's
+ProcGlobal->xids[] entry. */

Can we please NOT put Copy into the name like that? Pretty please?

+ int pgxactoff; /* offset into various ProcGlobal-> arrays
+ * NB: can change any time unless locks held!
+ */

I'm going to add the helpful comment "NB: can change any time unless
locks held!" to every data structure in the backend that is in shared
memory and not immutable. No need, of course, to mention WHICH
locks...

On a related note, PROC_HDR really, really, really needs comments
explaining the locking regimen for the new xids field.

+ ProcGlobal->xids[pgxactoff] = InvalidTransactionId;

Apparently this array is not dense in the sense that it excludes
unused slots, but comments elsewhere don't seem to entirely agree.
Maybe the comments discussing how it is "dense" need to be a little
more precise about this.

+ for (int i = 0; i < nxids; i++)

I miss my C89. Yeah, it's just me.

- if (!suboverflowed)
+ if (suboverflowed)
+ continue;
+

Do we really need to do this kind of diddling in this patch? I mean
yes to the idea, but no to things that are going to make it harder to
understand what happened if this blows up.

+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;

  /* ProcGlobal */
  size = add_size(size, sizeof(PROC_HDR));
- /* MyProcs, including autovacuum workers and launcher */
- size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC)));
- /* AuxiliaryProcs */
- size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC)));
- /* Prepared xacts */
- size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC)));
- /* ProcStructLock */
+ size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC)));

This seems like a bad idea. If we establish a precedent that it's OK
to have sizing routines that don't use add_size() and mul_size(),
people are going to cargo cult that into places where there is more
risk of overflow than there is here.

You've got a bunch of different places that talk about the new PGXACT
array and they are somewhat redundant yet without saying exactly the
same thing every time either. I think that needs cleanup.

One thing I didn't see is any clear discussion of what happens if the
two copies of the XID information don't agree with each other. That
should be added someplace, either in an appropriate code comment or in
a README or something. I *think* both are protected by the same locks,
but there's also some unlocked access to those structure members, so
it's not entirely a slam dunk.

...Robert




Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
Hi,

Thanks for the review!


On 2020-04-07 12:41:07 -0400, Robert Haas wrote:
> In 0002, the comments in SnapshotSet() are virtually incomprehensible.
> There's no commit message so the reasons for the changes are unclear.
> But mostly looks unproblematic.

I was planning to drop that patch pre-commit, at least for now.  I think
there's a few live bugs here, but they're all older. I did send a few emails
about the class of problem, unfortunately it was a fairly one-sided
conversation so far ;)

https://www.postgresql.org/message-id/20200407072418.ccvnyjbrktyi3rzc%40alap3.anarazel.de


> 0003 looks like a fairly unrelated bug fix that deserves to be
> discussed on the thread related to the original patch. Probably should
> be an open item.

There was some discussion in a separate thread:
https://www.postgresql.org/message-id/20200406025651.fpzdb5yyb7qyhqko%40alap3.anarazel.de
The only reason for including it in this patch stack is that I can't
really execercise the patchset without the fix (it's a bit sad that this
issue has gone unnoticed for months before I found it as part of the
development of this patch).

Think I'll push a minimal version now, and add an open item.


> 
> Regarding 0005:
> 
> There's sort of a mix of terminology here: are we pruning tuples or
> removing tuples or testing whether things are invisible? It would be
> better to be more consistent.
> 
> + * State for testing whether tuple versions may be removed. To improve
> + * GetSnapshotData() performance we don't compute an accurate value whenever
> + * acquiring a snapshot. Instead we compute boundaries above/below which we
> + * know that row versions are [not] needed anymore.  If at test time values
> + * falls in between the two, the boundaries can be recomputed (unless that
> + * just happened).
> 
> I don't like the wording here much. Maybe: State for testing whether
> an XID is invisible to all current snapshots. If an XID precedes
> maybe_needed_bound, it's definitely not visible to any current
> snapshot. If it equals or follows definitely_needed_bound, that XID
> isn't necessarily invisible to all snapshots. If it falls in between,
> we're not sure. If, when testing a particular tuple, we see an XID
> somewhere in the middle, we can try recomputing the boundaries to get
> a more accurate answer (unless we've just done that). This is cheaper
> than maintaining an accurate value all the time.

I'll incorporate that, thanks.


> There's also the problem that this sorta contradicts the comment for
> definitely_needed_bound. There it says intermediate values needed to
> be tested against the ProcArray, whereas here it says we need to
> recompute the bounds. That's kinda confusing.

For me those are the same. Computing an accurate bound is visitting the
procarray. But I'll rephrase.


> ComputedHorizons seems like a fairly generic name. I think there's
> some relationship between InvisibleToEveryoneState and
> ComputedHorizons that should be brought out more clearly by the naming
> and the comments.

I don't like the naming of ComputedHorizons, ComputeTransactionHorizons
much... But I find it hard to come up with something that's meaningfully
better.

I'll add a comment.


> + /*
> + * The value of ShmemVariableCache->latestCompletedFullXid when
> + * ComputeTransactionHorizons() held ProcArrayLock.
> + */
> + FullTransactionId latest_completed;
> +
> + /*
> + * The same for procArray->replication_slot_xmin and.
> + * procArray->replication_slot_catalog_xmin.
> + */
> + TransactionId slot_xmin;
> + TransactionId slot_catalog_xmin;
> 
> Department of randomly inconsistent names. In general I think it's
> quite hard to grasp the relationship between the different fields in
> ComputedHorizons.

What's the inconsistency? The dropped replication_ vs dropped FullXid
postfix?



> +
> +bool
> +GinPageIsRecyclable(Page page)
> 
> Needs a comment. Or more than one.

Well, I started to write one a couple times. But it's really just moving
the pre-existing code from the macro into a function and there weren't
any comments around *any* of it before. All my comment attempts
basically just were restating the code in so many words, or would have
required more work than I saw justified in the context of just moving
code.


> - /*
> - * If a transaction wrote a commit record in the gap between taking and
> - * logging the snapshot then latestCompletedXid may already be higher than
> - * the value from the snapshot, so check before we use the incoming value.
> - */
> - if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
> -   running->latestCompletedXid))
> - ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
> -
> - Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
> -
> - LWLockRelease(ProcArrayLock);
> 
> This code got relocated so that the lock is released later, but you
> didn't add any comments explaining why. Somebody will move it back and
> then you'll yet at them for doing it wrong. 

Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Robert Haas
Comments:

In 0002, the comments in SnapshotSet() are virtually incomprehensible.
There's no commit message so the reasons for the changes are unclear.
But mostly looks unproblematic.

0003 looks like a fairly unrelated bug fix that deserves to be
discussed on the thread related to the original patch. Probably should
be an open item.

0004 looks fine.

Regarding 0005:

There's sort of a mix of terminology here: are we pruning tuples or
removing tuples or testing whether things are invisible? It would be
better to be more consistent.

+ * State for testing whether tuple versions may be removed. To improve
+ * GetSnapshotData() performance we don't compute an accurate value whenever
+ * acquiring a snapshot. Instead we compute boundaries above/below which we
+ * know that row versions are [not] needed anymore.  If at test time values
+ * falls in between the two, the boundaries can be recomputed (unless that
+ * just happened).

I don't like the wording here much. Maybe: State for testing whether
an XID is invisible to all current snapshots. If an XID precedes
maybe_needed_bound, it's definitely not visible to any current
snapshot. If it equals or follows definitely_needed_bound, that XID
isn't necessarily invisible to all snapshots. If it falls in between,
we're not sure. If, when testing a particular tuple, we see an XID
somewhere in the middle, we can try recomputing the boundaries to get
a more accurate answer (unless we've just done that). This is cheaper
than maintaining an accurate value all the time.

There's also the problem that this sorta contradicts the comment for
definitely_needed_bound. There it says intermediate values needed to
be tested against the ProcArray, whereas here it says we need to
recompute the bounds. That's kinda confusing.

ComputedHorizons seems like a fairly generic name. I think there's
some relationship between InvisibleToEveryoneState and
ComputedHorizons that should be brought out more clearly by the naming
and the comments.

+ /*
+ * The value of ShmemVariableCache->latestCompletedFullXid when
+ * ComputeTransactionHorizons() held ProcArrayLock.
+ */
+ FullTransactionId latest_completed;
+
+ /*
+ * The same for procArray->replication_slot_xmin and.
+ * procArray->replication_slot_catalog_xmin.
+ */
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;

Department of randomly inconsistent names. In general I think it's
quite hard to grasp the relationship between the different fields in
ComputedHorizons.

+static inline bool OldSnapshotThresholdActive(void)
+{
+ return old_snapshot_threshold >= 0;
+}

Formatting.

+
+bool
+GinPageIsRecyclable(Page page)

Needs a comment. Or more than one.

- /*
- * If a transaction wrote a commit record in the gap between taking and
- * logging the snapshot then latestCompletedXid may already be higher than
- * the value from the snapshot, so check before we use the incoming value.
- */
- if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-   running->latestCompletedXid))
- ShmemVariableCache->latestCompletedXid = running->latestCompletedXid;
-
- Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid));
-
- LWLockRelease(ProcArrayLock);

This code got relocated so that the lock is released later, but you
didn't add any comments explaining why. Somebody will move it back and
then you'll yet at them for doing it wrong. :-)

+ /*
+ * Must have called GetOldestVisibleTransactionId() if using SnapshotAny.
+ * Shouldn't have for an MVCC snapshot. (It's especially worth checking
+ * this for parallel builds, since ambuild routines that support parallel
+ * builds must work these details out for themselves.)
+ */
+ Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
+ Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
+!TransactionIdIsValid(OldestXmin));
+ Assert(snapshot == SnapshotAny || !anyvisible);

This looks like a gratuitous code relocation.

+HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer,
TransactionId *dead_after)

I don't much like the name dead_after, but I don't have a better
suggestion, either.

- * Deleter committed, but perhaps it was recent enough that some open
- * transactions could still see the tuple.
+ * Deleter committed, allow caller to check if it was recent enough that
+ * some open transactions could still see the tuple.

I think you could drop this change.

+ /*
+ * State related to determining whether a dead tuple is still needed.
+ */
+ InvisibleToEveryoneState *vistest;
+ TimestampTz limited_oldest_ts;
+ TransactionId limited_oldest_xmin;
+ /* have we made removal decision based on old_snapshot_threshold */
+ bool limited_oldest_committed;

Would benefit from more comments.

+ * accuring to prstate->vistest, but that can be removed based on

Typo.

Generally, heap_prune_satisfies_vacuum looks pretty good. The
limited_oldest_committed naming is confusing, but the comments make it
a lot clearer.

+ * If oldest btpo.xact in the deleted 

Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Jonathan S. Katz
On 4/7/20 8:15 AM, Andres Freund wrote:

> I think this is pretty close to being committable.
> 
> 
> But: This patch came in very late for v13, and it took me much longer to
> polish it up than I had hoped (partially distraction due to various bugs
> I found (in particular snapshot_too_old), partially covid19, partially
> "hell if I know"). The patchset touches core parts of the system. While
> both Thomas and David have done some review, they haven't for the latest
> version (mea culpa).
> 
> In many other instances I would say that the above suggests slipping to
> v14, given the timing.
> 
> The main reason I am considering pushing is that I think this patcheset
> addresses one of the most common critiques of postgres, as well as very
> common, hard to fix, real-world production issues. GetSnapshotData() has
> been a major bottleneck for about as long as I have been using postgres,
> and this addresses that to a significant degree.
> 
> A second reason I am considering it is that, in my opinion, the changes
> are not all that complicated and not even that large. At least not for a
> change to a problem that we've long tried to improve.

Even as recently as earlier this week there was a blog post making the
rounds about the pain points running PostgreSQL with many simultaneous
connections. Anything to help with that would go a long way, and looking
at the benchmarks you ran (at least with a quick, nonthorough glance)
this could and should be very positively impactful to a *lot* of
PostgreSQL users.

I can't comment on the "close to committable" aspect (at least not with
an informed, confident opinion) but if it is indeed close to committable
and you can put the work to finish polishing (read: "bug fixing" :-) and
we have a plan both of testing and, if need be, to revert, I would be
okay with including it, for whatever my vote is worth. Is the timing /
situation ideal? No, but the way you describe it, it sounds like there
is enough that can be done to ensure it's ready for Beta 1.

From a RMT standpoint, perhaps this is one of the "Recheck at Mid-Beta"
items, as well.

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: Improving connection scalability: GetSnapshotData()

2020-04-07 Thread Andres Freund
On 2020-04-07 05:15:03 -0700, Andres Freund wrote:
> Attached is a substantially polished version of my patches. Note that
> the first three patches, as well as the last, are not intended to be
> committed at this time / in this form - they're there to make testing
> easier.

I didn't actually attached that last not-to-be-committed patch... It's
just the pgbench patch that I had attached before (and started a thread
about). Here it is again.
>From 59a9a03da728d53364f9c3d6fe8b48e21697b93e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 6 Apr 2020 21:28:55 -0700
Subject: [PATCH v7 12/12] WIP: pgbench

---
 src/bin/pgbench/pgbench.c | 107 +-
 1 file changed, 83 insertions(+), 24 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e99af801675..21d1ab2aac1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -310,6 +310,10 @@ typedef struct RandomState
 /* Various random sequences are initialized from this one. */
 static RandomState base_random_sequence;
 
+#ifdef ENABLE_THREAD_SAFETY
+pthread_barrier_t conn_barrier;
+#endif
+
 /*
  * Connection state machine states.
  */
@@ -5206,6 +5210,10 @@ printResults(StatsData *total, instr_time total_time,
 	tps_exclude = ntx /
 		(time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients));
 
+	//fprintf(stderr, "time: include: %f, exclude: %f, conn total: %f\n",
+	//		time_include, time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients),
+	//		INSTR_TIME_GET_DOUBLE(conn_total_time));
+
 	/* Report test parameters. */
 	printf("transaction type: %s\n",
 		   num_scripts == 1 ? sql_script[0].desc : "multiple scripts");
@@ -6126,26 +6134,14 @@ main(int argc, char **argv)
 	/* all clients must be assigned to a thread */
 	Assert(nclients_dealt == nclients);
 
-	/* get start up time */
-	INSTR_TIME_SET_CURRENT(start_time);
-
-	/* set alarm if duration is specified. */
-	if (duration > 0)
-		setalarm(duration);
-
 	/* start threads */
 #ifdef ENABLE_THREAD_SAFETY
+	pthread_barrier_init(_barrier, NULL, nthreads);
+
 	for (i = 0; i < nthreads; i++)
 	{
 		TState	   *thread = [i];
 
-		INSTR_TIME_SET_CURRENT(thread->start_time);
-
-		/* compute when to stop */
-		if (duration > 0)
-			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
-(int64) 100 * duration;
-
 		/* the first thread (i = 0) is executed by main thread */
 		if (i > 0)
 		{
@@ -6162,13 +6158,38 @@ main(int argc, char **argv)
 			thread->thread = INVALID_THREAD;
 		}
 	}
-#else
-	INSTR_TIME_SET_CURRENT(threads[0].start_time);
-	/* compute when to stop */
+#endif			/* ENABLE_THREAD_SAFETY */
+
+#ifdef ENABLE_THREAD_SAFETY
+	/* wait till all threads started (threads wait in threadRun()) */
+	//fprintf(stderr, "andres: waiting for thread start: %u\n", threads[0].tid);
+	pthread_barrier_wait(_barrier);
+#endif			/* ENABLE_THREAD_SAFETY */
+
+	/* get start up time */
+	INSTR_TIME_SET_CURRENT(start_time);
+
+	/* */
+	for (i = 0; i < nthreads; i++)
+	{
+		TState	   *thread = [i];
+
+		thread->start_time = start_time;
+
+		/* compute when to stop */
+		if (duration > 0)
+			end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) +
+(int64) 100 * duration;
+	}
+
+	/* set alarm if duration is specified. */
 	if (duration > 0)
-		end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) +
-			(int64) 100 * duration;
-	threads[0].thread = INVALID_THREAD;
+		setalarm(duration);
+
+#ifdef ENABLE_THREAD_SAFETY
+	/* updated start time (threads wait in threadRun()) */
+	//fprintf(stderr, "andres: %u: waiting for start time\n", threads[0].tid);
+	pthread_barrier_wait(_barrier);
 #endif			/* ENABLE_THREAD_SAFETY */
 
 	/* wait for threads and accumulate results */
@@ -6236,12 +6257,30 @@ threadRun(void *arg)
 	int			i;
 
 	/* for reporting progress: */
-	int64		thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
-	int64		last_report = thread_start;
-	int64		next_report = last_report + (int64) progress * 100;
+	int64		thread_start;
+	int64		last_report;
+	int64		next_report;
 	StatsData	last,
 aggs;
 
+	/* wait till all threads started (main waits outside) */
+	if (thread->tid != 0)
+	{
+		//fprintf(stderr, "andres: %u: waiting for thread start\n", thread->tid);
+		pthread_barrier_wait(_barrier);
+	}
+
+	/* wait for start time to be initialized (main waits outside) */
+	if (thread->tid != 0)
+	{
+		//fprintf(stderr, "andres: %u: waiting for start time\n", thread->tid);
+		pthread_barrier_wait(_barrier);
+	}
+
+	thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time);
+	last_report = thread_start;
+	next_report = last_report + (int64) progress * 100;
+
 	/*
 	 * Initialize throttling rate target for all of the thread's clients.  It
 	 * might be a little more accurate to reset thread->start_time here too.
@@ -6288,7 +6327,27 @@ threadRun(void *arg)
 
 	/* time after thread and connections set up */
 	INSTR_TIME_SET_CURRENT(thread->conn_time);
-	

Re: Improving connection scalability: GetSnapshotData()

2020-04-06 Thread Andres Freund
Hi,

On 2020-04-06 06:39:59 -0700, Andres Freund wrote:
> > 3) Plain pgbench read-write (you already did it for sure)
> 
> -s 100 -M prepared -T 700
> 
> autovacuum=off, fsync on:
> clients tps master  tps pgxact
> 1   474 479
> 16  43564476
> 40  85919309
> 198 20045   20261
> 102417986   18545
> 
> autovacuum=off, fsync off:
> clients tps master  tps pgxact
> 1   78287719
> 16  49069   50482
> 40  68241   73081
> 198 73464   77801
> 102425621   28410
> 
> I chose autovacuum off because otherwise the results vary much more
> widely, and autovacuum isn't really needed for the workload.

> These numbers seem pretty decent to me. The regressions seem mostly
> within noise. The one possible exception to that is plain pgbench
> read/write with fsync=off and only a single session. I'll run more
> benchmarks around that tomorrow (but now it's 6am :().

The "one possible exception" turned out to be a "real" regression, but
one that was dead easy to fix: It was an DEBUG1 elog I had left in. The
overhead seems to solely have been the increased code size + overhead of
errstart().  After that there's no difference in the single client case
anymore (I'd not expect a benefit).

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-06 Thread Andres Freund
Hi,

On 2020-04-06 06:39:59 -0700, Andres Freund wrote:
> These benchmarks are on my workstation. The larger VM I used in the last
> round wasn't currently available.

One way to reproduce the problem at smaller connection counts / smaller
machines is to take more snapshots. Doesn't fully reproduce the problem,
because resetting ->xmin without xact overhead is part of the problem,
but it's helpful.

I use a volatile function that loops over a trivial statement. There's
probably an easier / more extreme way to reproduce the problem. But it's
good enough.

-- setup
CREATE OR REPLACE FUNCTION snapme(p_ret int, p_loop int) RETURNS int VOLATILE 
LANGUAGE plpgsql AS $$BEGIN FOR x in 1..p_loop LOOP EXECUTE 'SELECT 1';END 
LOOP; RETURN p_ret; END;$$;
-- statement executed in parallel
SELECT snapme(17, 1);

before (all above 1.5%):
+   37.82%  postgres  postgres  [.] GetSnapshotData
+6.26%  postgres  postgres  [.] AllocSetAlloc
+3.77%  postgres  postgres  [.] base_yyparse
+3.04%  postgres  postgres  [.] core_yylex
+1.94%  postgres  postgres  [.] grouping_planner
+1.83%  postgres  libc-2.30.so  [.] __strncpy_avx2
+1.80%  postgres  postgres  [.] palloc
+1.73%  postgres  libc-2.30.so  [.] __memset_avx2_unaligned_erms

after:
+5.75%  postgres  postgres  [.] base_yyparse
+4.37%  postgres  postgres  [.] palloc
+4.29%  postgres  postgres  [.] AllocSetAlloc
+3.75%  postgres  postgres  [.] expression_tree_walker.part.0
+3.14%  postgres  postgres  [.] core_yylex
+2.51%  postgres  postgres  [.] subquery_planner
+2.48%  postgres  postgres  [.] CheckExprStillValid
+2.45%  postgres  postgres  [.] check_stack_depth
+2.42%  postgres  plpgsql.so[.] exec_stmt
+1.92%  postgres  libc-2.30.so  [.] __memset_avx2_unaligned_erms
+1.91%  postgres  postgres  [.] query_tree_walker
+1.88%  postgres  libc-2.30.so  [.] __GI_strtoll_l_internal
+1.86%  postgres  postgres  [.] _SPI_execute_plan
+1.85%  postgres  postgres  [.] assign_query_collations_walker
+1.84%  postgres  postgres  [.] remove_useless_results_recurse
+1.83%  postgres  postgres  [.] grouping_planner
+1.50%  postgres  postgres  [.] set_plan_refs


If I change the workload to be
BEGIN;
SELECT txid_current();
SELECT snapme(17, 1000);
COMMIT;


the difference reduces (because GetSnapshotData() only needs to look at
procs with xids, and xids are assigned for much longer), but still is
significant:

before (all above 1.5%):
+   35.89%  postgres  postgres[.] GetSnapshotData
+7.94%  postgres  postgres[.] AllocSetAlloc
+4.42%  postgres  postgres[.] base_yyparse
+3.62%  postgres  libc-2.30.so[.] __memset_avx2_unaligned_erms
+2.87%  postgres  postgres[.] LWLockAcquire
+2.76%  postgres  postgres[.] core_yylex
+2.30%  postgres  postgres[.] expression_tree_walker.part.0
+1.81%  postgres  postgres[.] MemoryContextAllocZeroAligned
+1.80%  postgres  postgres[.] transformStmt
+1.66%  postgres  postgres[.] grouping_planner
+1.64%  postgres  postgres[.] subquery_planner

after:
+   24.59%  postgres  postgres  [.] GetSnapshotData
+4.89%  postgres  postgres  [.] base_yyparse
+4.59%  postgres  postgres  [.] AllocSetAlloc
+3.00%  postgres  postgres  [.] LWLockAcquire
+2.76%  postgres  postgres  [.] palloc
+2.27%  postgres  postgres  [.] MemoryContextAllocZeroAligned
+2.26%  postgres  postgres  [.] check_stack_depth
+1.77%  postgres  postgres  [.] core_yylex

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-04-06 Thread Andres Freund
Hi,

These benchmarks are on my workstation. The larger VM I used in the last
round wasn't currently available.

HW:
2 x Intel(R) Xeon(R) Gold 5215 (each 10 cores / 20 threads)
192GB Ram.
data directory is on a Samsung SSD 970 PRO 1TB

A bunch of terminals, emacs, mutt are open while the benchmark is
running. No browser.

Unless mentioned otherwise, relevant configuration options are:
max_connections=1200
shared_buffers=8GB
max_prepared_transactions=1000
synchronous_commit=local
huge_pages=on
fsync=off # to make it more likely to see scalability bottlenecks


Independent of the effects of this patch (i.e. including master) I had a
fairly hard time getting reproducible number for *low* client cases. I
found the numbers to be more reproducible if I pinned server/pgbench
onto the same core :(.  I chose to do that for the -c1 cases, to
benchmark the optimal behaviour, as that seemed to have the biggest
potential for regressions.

All numbers are best of three. Tests start in freshly created cluster
each.


On 2020-03-30 17:04:00 +0300, Alexander Korotkov wrote:
> Following pgbench scripts comes first to my mind:
> 1) SELECT txid_current(); (artificial but good for checking corner case)

-M prepared -T 180
(did a few longer runs, but doesn't seem to matter much)

clients tps master  tps pgxact
1   46118   46027
16  377357  440233
40  373304  410142
198 103912  105579

btw, there's some pretty horrible cacheline bouncing in txid_current()
because backends first ReadNextFullTransactionId() (acquires XidGenLock
in shared mode, reads ShmemVariableCache->nextFullXid), then separately
causes GetNewTransactionId() (acquires XidGenLock exclusively, reads &
writes nextFullXid).

With for fsync=off (and also for synchronous_commit=off) the numbers
are, at lower client counts, severly depressed and variable due to
walwriter going completely nuts (using more CPU than the backend doing
the queries). Because WAL writes are so fast on my storage, individual
XLogBackgroundFlush() calls are very quick. This leads to a *lot* of
kill()s from the backend, from within XLogSetAsyncXactLSN().  There got
to be a bug here.  But unrelated.

> 2) Single insert statement (as example of very short transaction)

CREATE TABLE testinsert(c1 int not null, c2 int not null, c3 int not null, c4 
int not null);
INSERT INTO testinsert VALUES(1, 2, 3, 4);

-M prepared -T 360

fsync on:
clients tps master  tps pgxact
1   653 658
16  56875668
40  14212   14229
198 60483   62420

fsync off:
clients tps master  tps pgxact
1   59356   59891
16  290626  21
40  348210  355669
198 289182  291529

clients tps master  tps pgxact
102447586   52135

-M simple
fsync off:
clients tps master  tps pgxact
40  289077  326699
198 286011  299928




> 3) Plain pgbench read-write (you already did it for sure)

-s 100 -M prepared -T 700

autovacuum=off, fsync on:
clients tps master  tps pgxact
1   474 479
16  43564476
40  85919309
198 20045   20261
102417986   18545

autovacuum=off, fsync off:
clients tps master  tps pgxact
1   78287719
16  49069   50482
40  68241   73081
198 73464   77801
102425621   28410

I chose autovacuum off because otherwise the results vary much more
widely, and autovacuum isn't really needed for the workload.



> 4) pgbench read-write script with increased amount of SELECTs.  Repeat
> select from pgbench_accounts say 10 times with different aids.

I did intersperse all server-side statements in the script with two
selects of other pgbench_account rows each.

-s 100 -M prepared -T 700
autovacuum=off, fsync on:
clients tps master  tps pgxact
1   365 367
198 20065   21391

-s 1000 -M prepared -T 700
autovacuum=off, fsync on:
clients tps master  tps pgxact
16  27572880
40  47344996
198 16950   19998
102422423   24935


> 5) 10% pgbench read-write, 90% of pgbench read-only

-s 100 -M prepared -T 100 -bselect-only@9 -btpcb-like@1

autovacuum=off, fsync on:
clients tps master  tps pgxact
16  37289   38656
40  81284   81260
198 189002  189357
1024143986  164762


> > That definitely needs to be measured, due to the locking changes around 
> > procarrayaddd/remove.
> >
> > I don't think regressions besides perhaps 2pc are likely - there's nothing 
> > really getting more expensive but procarray add/remove.
>
> I agree that ProcArrayAdd()/Remove() should be first subject of
> 

Re: Improving connection scalability: GetSnapshotData()

2020-04-05 Thread David Rowley
On Sun, 1 Mar 2020 at 21:47, Andres Freund  wrote:
> On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
> > conns   tps mastertps pgxact-split
> >
> > 1   26842.49284526524.194821
> > 10  246923.158682   249224.782661
> > 50  695956.539704   709833.746374
> > 100 1054727.043139  1903616.306028
> > 200 964795.282957   1949200.338012
> > 300 906029.377539   1927881.231478
> > 400 845696.690912   1911065.369776
> > 500 812295.222497   1926237.255856
> > 600 888030.104213   1903047.236273
> > 700 866896.532490   1886537.202142
> > 800 863407.341506   1883768.592610
> > 900 871386.608563   1874638.012128
> > 1000887668.277133   1876402.391502
> > 1500860051.361395   1815103.564241
> > 2000890900.098657   1775435.271018
> > 3000874184.980039   1653953.817997
> > 4000845023.080703   1582582.316043
> > 5000817100.195728   1512260.802371
> >
> > I think these are pretty nice results.

FWIW, I took this for a spin on an AMD 3990x:

# setup
pgbench -i postgres

#benchmark
#!/bin/bash

for i in 1 10 50 100 200 300 400 500 600 700 800 900 1000 1500 2000
3000 4000 5000;
do
echo Testing with $i connections >> bench.log
pgbench2 -M prepared -c $i -j $i -S -n -T 60 postgres >> bench.log
done

pgbench2 is your patched version pgbench. I got some pretty strange
results with the unpatched version.  Up to about 50 million tps for
excluding connection establishing, which seems pretty farfetched

connectionsUnpatchedPatched
149062.2441349834.64983
10428673.1027453290.5985
501552413.0841849233.821
1002039675.0272261437.1
2003139648.8453632008.991
3003091248.3163597748.942
4003056453.53567888.293
5003019571.473574009.053
6002991052.3933537518.903
7002952484.7633553252.603
8002910976.8753539404.865
9002873929.9893514353.776
10002846859.4993490006.026
15002540003.0383370093.934
20002361799.1073197556.738
30002056973.7782949740.692
40001751418.1172627174.81
50001464786.4612334586.042

> Attached as a graph as well.

Likewise.

David


Re: Improving connection scalability: GetSnapshotData()

2020-03-31 Thread Andres Freund
Hi,

On 2020-03-31 13:04:38 -0700, Andres Freund wrote:
> I'm still fighting with snapshot_too_old. The feature is just badly
> undertested, underdocumented, and there's lots of other oddities. I've
> now spent about as much time on that feature than on the whole rest of
> the patchset.

To expand on this being under-tested: The whole time mapping
infrastructure is not tested, because all of that is bypassed when
old_snapshot_threshold = 0.  And old_snapshot_threshold = 0 basically
only exists for testing.  The largest part of the complexity of this
feature are TransactionIdLimitedForOldSnapshots() and
MaintainOldSnapshotTimeMapping() - and none of the complexity is tested
due to the tests running with old_snapshot_threshold = 0.

So we have test only infrastructure that doesn't allow to actually test
the feature.


And the tests that we do have don't have a single comment explaining
what the expected results are. Except for the newer
sto_using_hash_index.spec, they just run all permutations. I don't know
how those tests actually help, since it's not clear why any of the
results are the way they are. And which just are the results of
bugs. Ore not affected by s_t_o.

Greetings,

Andres Freund




  1   2   >