Re: Improving connection scalability: GetSnapshotData()
On 27.02.2021 20:40, AJG wrote: Hi, Greatly appreciate if you could please reply to the following questions as time allows. I have seen previous discussion/patches on a built-in connection pooler. How does this scalability improvement, particularly idle connection improvements etc, affect that built-in pooler need, if any? Same general question about an external connection pooler in general in Production? Still required to route to different servers but no longer needed for the pooling part. as an example. Many Thanks! Connection pooler is still needed. The patch for GetSnapshotData() mostly improves scalability of read-only queries and reduce contention for procarray lock. But read-write upload cause contention for many other resources: relation extension lock, buffer locks, tuple locks and so on. If you run pgbench at NUMA machine with hundreds of cores, then you will still observe significant degradation of performance with increasing number of connection. And this degradation will be dramatic if you replace some non-uniform distribution of keys, for example Zipfian distribution. -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: Improving connection scalability: GetSnapshotData()
- Mensagem original - > De: "AJG" > Para: "Pg Hackers" > Enviadas: Sábado, 27 de fevereiro de 2021 14:40:58 > Assunto: Re: Improving connection scalability: GetSnapshotData() > Hi, > Greatly appreciate if you could please reply to the following questions as > time allows. > I have seen previous discussion/patches on a built-in connection pooler. How > does this scalability improvement, particularly idle connection improvements > etc, affect that built-in pooler need, if any? > Same general question about an external connection pooler in general in > Production? Still required to route to different servers but no longer > needed for the pooling part. as an example. > Many Thanks! > -- > Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html As I understand it, the improvements made to GetSnapShotData() mean having higher connection count does not incur as much a penalty to performance as before. I am not sure it solves the connection stablishment side of things, but I may be wrong. Luis R. Weck
Re: Improving connection scalability: GetSnapshotData()
Hi, Greatly appreciate if you could please reply to the following questions as time allows. I have seen previous discussion/patches on a built-in connection pooler. How does this scalability improvement, particularly idle connection improvements etc, affect that built-in pooler need, if any? Same general question about an external connection pooler in general in Production? Still required to route to different servers but no longer needed for the pooling part. as an example. Many Thanks! -- Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Re: Improving connection scalability: GetSnapshotData()
On 10/5/20 10:33 PM, Andres Freund wrote: > Hi, > > On 2020-10-01 19:21:14 -0400, Andrew Dunstan wrote: >> On 10/1/20 4:22 PM, Andres Freund wrote: >>> # Note: on Windows, IPC::Run seems to convert \r\n to \n in program >>> output >>> # if we're using native Perl, but not if we're using MSys Perl. So do >>> it >>> # by hand in the latter case, here and elsewhere. >>> that IPC::Run converts things, but that native windows perl uses >>> https://perldoc.perl.org/perlrun#PERLIO >>> a PERLIO that includes :crlf, whereas msys probably doesn't? >>> >>> Any chance you could run something like >>> perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' >>> on both native and msys perl? >>> >>> possibly also for stderr, just to make it more futureproof, and at the top of the file: use Config; Do you want me to test that first? >>> That'd be awesome. >> The change I suggested makes jacana happy. > Thanks, pushed. Hopefully that fixes the mingw animals. > I don't think we're out of the woods yet. This test is also have bad effects on bowerbird, which is an MSVC animal. It's hanging completely :-( Digging some more. cheers andrew -- Andrew Dunstan EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-10-01 19:21:14 -0400, Andrew Dunstan wrote: > On 10/1/20 4:22 PM, Andres Freund wrote: > > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program > > output > > # if we're using native Perl, but not if we're using MSys Perl. So do > > it > > # by hand in the latter case, here and elsewhere. > > that IPC::Run converts things, but that native windows perl uses > > https://perldoc.perl.org/perlrun#PERLIO > > a PERLIO that includes :crlf, whereas msys probably doesn't? > > > > Any chance you could run something like > > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' > > on both native and msys perl? > > > > > >> possibly also for stderr, just to make it more futureproof, and at the > >> top of the file: > >> > >> use Config; > >> > >> > >> Do you want me to test that first? > > That'd be awesome. > The change I suggested makes jacana happy. Thanks, pushed. Hopefully that fixes the mingw animals. I wonder if we instead should do something like # Have mingw perl treat CRLF the same way as perl on native windows does ifeq ($(build_os),mingw32) PROVE="PERLIO=unixcrlf $(PROVE)" endif in Makefile.global.in? Then we could remove these rexes from all the various places? Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 2020/10/02 3:26, Andres Freund wrote: Hi Ian, Andrew, All, On 2020-09-30 15:43:17 -0700, Andres Freund wrote: Attached is an updated version of the test (better utility function, stricter regexes, bailing out instead of failing just the current when psql times out). I'm leaving it in this test for now, but it's fairly easy to use this way, in my opinion, so it may be worth moving to PostgresNode at some point. I pushed this yesterday. Ian, thanks again for finding this and helping with fixing & testing. Thanks! Apologies for not getting back to your earlier responses, have been distracted by Various Other Things. The tests I run which originally triggered the issue now run just fine. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Improving connection scalability: GetSnapshotData()
On 10/1/20 4:22 PM, Andres Freund wrote: > Hi, > > On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote: >> My strong suspicion is that we're getting unwanted CRs. Note the >> presence of numerous instances of this in PostgresNode.pm: > >> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; >> >> So you probably want something along those lines at the top of the loop >> in send_query_and_wait: >> >> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; > Yikes, that's ugly :(. > > > I assume it's not, as the comments says > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program > output > # if we're using native Perl, but not if we're using MSys Perl. So do > it > # by hand in the latter case, here and elsewhere. > that IPC::Run converts things, but that native windows perl uses > https://perldoc.perl.org/perlrun#PERLIO > a PERLIO that includes :crlf, whereas msys probably doesn't? > > Any chance you could run something like > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' > on both native and msys perl? > > >> possibly also for stderr, just to make it more futureproof, and at the >> top of the file: >> >> use Config; >> >> >> Do you want me to test that first? > That'd be awesome. > > > The change I suggested makes jacana happy. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-10-01 16:44:03 -0400, Andrew Dunstan wrote: > > I assume it's not, as the comments says > > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program > > output > > # if we're using native Perl, but not if we're using MSys Perl. So do > > it > > # by hand in the latter case, here and elsewhere. > > that IPC::Run converts things, but that native windows perl uses > > https://perldoc.perl.org/perlrun#PERLIO > > a PERLIO that includes :crlf, whereas msys probably doesn't? > > > > Any chance you could run something like > > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' > > on both native and msys perl? > > sys (jacana): stdio > > native: unixcrlf Interesting. That suggest we could get around needing the if msys branches in several places by setting PERLIO to unixcrlf somewhere centrally when using msys. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 10/1/20 4:22 PM, Andres Freund wrote: > Hi, > > On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote: >> My strong suspicion is that we're getting unwanted CRs. Note the >> presence of numerous instances of this in PostgresNode.pm: > >> $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; >> >> So you probably want something along those lines at the top of the loop >> in send_query_and_wait: >> >> $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; > Yikes, that's ugly :(. > > > I assume it's not, as the comments says > # Note: on Windows, IPC::Run seems to convert \r\n to \n in program > output > # if we're using native Perl, but not if we're using MSys Perl. So do > it > # by hand in the latter case, here and elsewhere. > that IPC::Run converts things, but that native windows perl uses > https://perldoc.perl.org/perlrun#PERLIO > a PERLIO that includes :crlf, whereas msys probably doesn't? > > Any chance you could run something like > perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' > on both native and msys perl? > > sys (jacana): stdio native: unixcrlf cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-10-01 16:00:20 -0400, Andrew Dunstan wrote: > My strong suspicion is that we're getting unwanted CRs. Note the > presence of numerous instances of this in PostgresNode.pm: > $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; > > So you probably want something along those lines at the top of the loop > in send_query_and_wait: > > $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; Yikes, that's ugly :(. I assume it's not, as the comments says # Note: on Windows, IPC::Run seems to convert \r\n to \n in program output # if we're using native Perl, but not if we're using MSys Perl. So do it # by hand in the latter case, here and elsewhere. that IPC::Run converts things, but that native windows perl uses https://perldoc.perl.org/perlrun#PERLIO a PERLIO that includes :crlf, whereas msys probably doesn't? Any chance you could run something like perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");' on both native and msys perl? > possibly also for stderr, just to make it more futureproof, and at the > top of the file: > > use Config; > > > Do you want me to test that first? That'd be awesome. > The difference between the canonical way perl states the regex is due to > perl version differences. It shouldn't matter. Thanks! Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 10/1/20 2:26 PM, Andres Freund wrote: > Hi Ian, Andrew, All, > > On 2020-09-30 15:43:17 -0700, Andres Freund wrote: >> Attached is an updated version of the test (better utility function, >> stricter regexes, bailing out instead of failing just the current when >> psql times out). I'm leaving it in this test for now, but it's fairly >> easy to use this way, in my opinion, so it may be worth moving to >> PostgresNode at some point. > I pushed this yesterday. Ian, thanks again for finding this and helping > with fixing & testing. > > > Unfortunately currently some buildfarm animals don't like the test for > reasons I don't quite understand. Looks like it's all windows + msys > animals that run the tap tests. On jacana and fairywren the new test > fails, but with a somewhat confusing problem: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-10-01%2015%3A32%3A34 > Bail out! aborting wait: program timed out > # stream contents: >>data > # (0 rows) > # << > # pattern searched for: (?m-xis:^\\(0 rows\\)$) > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-10-01%2014%3A12%3A13 > Bail out! aborting wait: program timed out > stream contents: >>data > (0 rows) > << > pattern searched for: (?^m:^\\(0 rows\\)$) > > I don't know with the -xis indicates on jacana, and why it's not present > on fairywren. Nor do I know why the pattern doesn't match in the first > place, sure looks like it should? > > Andrew, do you have an insight into how mingw's regex match differs > from native windows and proper unixoid systems? I guess it's somewhere > around line endings or such? > > Jacana successfully deals with 013_crash_restart.pl, which does use the > same mechanis as the new 021_row_visibility.pl - I think the only real > difference is that I used ^ and $ in the regexes in the latter... My strong suspicion is that we're getting unwanted CRs. Note the presence of numerous instances of this in PostgresNode.pm: $stdout =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; So you probably want something along those lines at the top of the loop in send_query_and_wait: $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys'; possibly also for stderr, just to make it more futureproof, and at the top of the file: use Config; Do you want me to test that first? The difference between the canonical way perl states the regex is due to perl version differences. It shouldn't matter. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi Ian, Andrew, All, On 2020-09-30 15:43:17 -0700, Andres Freund wrote: > Attached is an updated version of the test (better utility function, > stricter regexes, bailing out instead of failing just the current when > psql times out). I'm leaving it in this test for now, but it's fairly > easy to use this way, in my opinion, so it may be worth moving to > PostgresNode at some point. I pushed this yesterday. Ian, thanks again for finding this and helping with fixing & testing. Unfortunately currently some buildfarm animals don't like the test for reasons I don't quite understand. Looks like it's all windows + msys animals that run the tap tests. On jacana and fairywren the new test fails, but with a somewhat confusing problem: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana=2020-10-01%2015%3A32%3A34 Bail out! aborting wait: program timed out # stream contents: >>data # (0 rows) # << # pattern searched for: (?m-xis:^\\(0 rows\\)$) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren=2020-10-01%2014%3A12%3A13 Bail out! aborting wait: program timed out stream contents: >>data (0 rows) << pattern searched for: (?^m:^\\(0 rows\\)$) I don't know with the -xis indicates on jacana, and why it's not present on fairywren. Nor do I know why the pattern doesn't match in the first place, sure looks like it should? Andrew, do you have an insight into how mingw's regex match differs from native windows and proper unixoid systems? I guess it's somewhere around line endings or such? Jacana successfully deals with 013_crash_restart.pl, which does use the same mechanis as the new 021_row_visibility.pl - I think the only real difference is that I used ^ and $ in the regexes in the latter... Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On Tue, 15 Sep 2020 at 07:17, Andres Freund wrote: > > Hi, > > On 2020-09-09 17:02:58 +0900, Ian Barwick wrote: > > Attached, though bear in mind I'm not very familiar with parts of this, > > particularly 2PC stuff, so consider it educated guesswork. > > Thanks for this, and the test case! > > Your commit fixes the issues, but not quite correctly. Multixacts > shouldn't matter, so we don't need to do anything there. And for the > increases, I think they should be inside the already existing > ProcArrayLock acquisition, as in the attached. > > > I've also included a quite heavily revised version of your test. Instead > of using dblink I went for having a long-running psql that I feed over > stdin. The main reason for not liking the previous version is that it > seems fragile, with the sleep and everything. I expanded it to cover > 2PC is as well. > > The test probably needs a bit of cleanup, wrapping some of the > redundancy around the pump_until calls. > > I think the approach of having a long running psql session is really > useful, and probably would speed up some tests. Does anybody have a good > idea for how to best, and without undue effort, to integrate this into > PostgresNode.pm? I don't really have a great idea, so I think I'd leave > it with a local helper in the new test? 2ndQ has some infra for that and various other TAP enhancements that I'd like to try to upstream. I'll ask what I can share and how. -- Craig Ringer http://www.2ndQuadrant.com/ 2ndQuadrant - PostgreSQL Solutions for the Enterprise
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-14 16:17:18 -0700, Andres Freund wrote: > I've also included a quite heavily revised version of your test. Instead > of using dblink I went for having a long-running psql that I feed over > stdin. The main reason for not liking the previous version is that it > seems fragile, with the sleep and everything. I expanded it to cover > 2PC is as well. > > The test probably needs a bit of cleanup, wrapping some of the > redundancy around the pump_until calls. > > I think the approach of having a long running psql session is really > useful, and probably would speed up some tests. Does anybody have a good > idea for how to best, and without undue effort, to integrate this into > PostgresNode.pm? I don't really have a great idea, so I think I'd leave > it with a local helper in the new test? Attached is an updated version of the test (better utility function, stricter regexes, bailing out instead of failing just the current when psql times out). I'm leaving it in this test for now, but it's fairly easy to use this way, in my opinion, so it may be worth moving to PostgresNode at some point. Greetings, Andres Freund >From 2ca4fa9de369aeba0d6386ec7d749cb366259728 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 14 Sep 2020 16:08:25 -0700 Subject: [PATCH v3] Fix and test snapshot behaviour on standby. I (Andres) broke this in 623a9ba79bb, because I didn't think about the way snapshots are built on standbys sufficiently. Unfortunately our existing tests did not catch this, as they are all just querying with psql (therefore ending up with fresh snapshots). The fix is trivial, we just need to increment the completion counter in ExpireTreeKnownAssignedTransactionIds(), which is the equivalent of ProcArrayEndTransaction() during recovery. This commit also adds a new test doing some basic testing of the correctness of snapshots built on standbys. To avoid the aforementioned issue of one-shot psql's not exercising the snapshot caching, the test uses a long lived psqls, similar to 013_crash_restart.pl. It'd be good to extend the test further. Reported-By: Ian Barwick Author: Andres Freund Author: Ian Barwick Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb...@2ndquadrant.com --- src/backend/storage/ipc/procarray.c | 3 + src/test/recovery/t/021_row_visibility.pl | 192 ++ 2 files changed, 195 insertions(+) create mode 100644 src/test/recovery/t/021_row_visibility.pl diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 5aaeb6e2b55..07c5eeb7495 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -4280,6 +4280,9 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids, /* As in ProcArrayEndTransaction, advance latestCompletedXid */ MaintainLatestCompletedXidRecovery(max_xid); + /* ... and xactCompletionCount */ + ShmemVariableCache->xactCompletionCount++; + LWLockRelease(ProcArrayLock); } diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl new file mode 100644 index 000..95516b05d01 --- /dev/null +++ b/src/test/recovery/t/021_row_visibility.pl @@ -0,0 +1,192 @@ +# Checks that snapshots on standbys behave in a minimally reasonable +# way. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 10; + +# Initialize primary node +my $node_primary = get_new_node('primary'); +$node_primary->init(allows_streaming => 1); +$node_primary->append_conf('postgresql.conf', 'max_prepared_transactions=10'); +$node_primary->start; + +# Initialize with empty test table +$node_primary->safe_psql('postgres', + 'CREATE TABLE public.test_visibility (data text not null)'); + +# Take backup +my $backup_name = 'my_backup'; +$node_primary->backup($backup_name); + +# Create streaming standby from backup +my $node_standby = get_new_node('standby'); +$node_standby->init_from_backup($node_primary, $backup_name, + has_streaming => 1); +$node_standby->append_conf('postgresql.conf', 'max_prepared_transactions=10'); +$node_standby->start; + +# To avoid hanging while expecting some specific input from a psql +# instance being driven by us, add a timeout high enough that it +# should never trigger even on very slow machines, unless something +# is really wrong. +my $psql_timeout = IPC::Run::timer(30); + +# One psql to primary and standby each, for all queries. That allows +# to check uncommitted changes being replicated and such. +my %psql_primary = (stdin => '', stdout => '', stderr => ''); +$psql_primary{run} = + IPC::Run::start( + ['psql', '-XA', '-f', '-', '-d', $node_primary->connstr('postgres')], + '<', \$psql_primary{stdin}, + '>', \$psql_primary{stdout}, + '2>', \$psql_primary{stderr}, + $psql_timeout); + +my %psql_standby = ('stdin' => '', 'stdout' => '', 'stderr' => ''); +$psql_standby{run} = + IPC::Run::start( + ['psql', '-XA', '-f', '-', '-d',
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-15 11:56:24 +0900, Michael Paquier wrote: > On Mon, Sep 14, 2020 at 05:42:51PM -0700, Andres Freund wrote: > > My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess > > isn't pretty. Just as 013_crash_restart.pl already did (even before > > psql/t/010_tab_completion.pl). I am mostly wondering whether we could > > avoid copying the utility functions into multiple test files... > > > > Does IO::Pty work on windows? Given that currently the test doesn't use > > a pty and that there's no benefit I can see in requiring one, I'm a bit > > hesitant to go there? > > Per https://metacpan.org/pod/IO::Tty: > "Windows is now supported, but ONLY under the Cygwin environment, see > http://sources.redhat.com/cygwin/.; > > So I would suggest to make stuff a soft dependency (as Tom is > hinting?), and not worry about Windows specifically. It is not like > what we are dealing with here is specific to Windows anyway, so you > would have already sufficient coverage. I would not mind if any > refactoring is done later, once we know that the proposed test is > stable in the buildfarm as we would get a better image of what part of > the facility overlaps across multiple tests. I'm confused - the test as posted should work on windows, and we already do this in an existing test (src/test/recovery/t/013_crash_restart.pl). What's the point in adding a platforms specific dependency here? Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On Mon, Sep 14, 2020 at 05:42:51PM -0700, Andres Freund wrote: > My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess > isn't pretty. Just as 013_crash_restart.pl already did (even before > psql/t/010_tab_completion.pl). I am mostly wondering whether we could > avoid copying the utility functions into multiple test files... > > Does IO::Pty work on windows? Given that currently the test doesn't use > a pty and that there's no benefit I can see in requiring one, I'm a bit > hesitant to go there? Per https://metacpan.org/pod/IO::Tty: "Windows is now supported, but ONLY under the Cygwin environment, see http://sources.redhat.com/cygwin/.; So I would suggest to make stuff a soft dependency (as Tom is hinting?), and not worry about Windows specifically. It is not like what we are dealing with here is specific to Windows anyway, so you would have already sufficient coverage. I would not mind if any refactoring is done later, once we know that the proposed test is stable in the buildfarm as we would get a better image of what part of the facility overlaps across multiple tests. -- Michael signature.asc Description: PGP signature
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-14 20:14:48 -0400, Tom Lane wrote: > Andres Freund writes: > > I think the approach of having a long running psql session is really > > useful, and probably would speed up some tests. Does anybody have a good > > idea for how to best, and without undue effort, to integrate this into > > PostgresNode.pm? I don't really have a great idea, so I think I'd leave > > it with a local helper in the new test? > > You could use the interactive_psql infrastructure that already exists > for psql/t/010_tab_completion.pl. That does rely on IO::Pty, but > I think I'd prefer to accept that dependency for such tests over rolling > our own IPC::Run, which is more or less what you've done here. My test uses IPC::Run - although I'm indirectly 'use'ing, which I guess isn't pretty. Just as 013_crash_restart.pl already did (even before psql/t/010_tab_completion.pl). I am mostly wondering whether we could avoid copying the utility functions into multiple test files... Does IO::Pty work on windows? Given that currently the test doesn't use a pty and that there's no benefit I can see in requiring one, I'm a bit hesitant to go there? Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Andres Freund writes: > I think the approach of having a long running psql session is really > useful, and probably would speed up some tests. Does anybody have a good > idea for how to best, and without undue effort, to integrate this into > PostgresNode.pm? I don't really have a great idea, so I think I'd leave > it with a local helper in the new test? You could use the interactive_psql infrastructure that already exists for psql/t/010_tab_completion.pl. That does rely on IO::Pty, but I think I'd prefer to accept that dependency for such tests over rolling our own IPC::Run, which is more or less what you've done here. regards, tom lane
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-09 17:02:58 +0900, Ian Barwick wrote: > Attached, though bear in mind I'm not very familiar with parts of this, > particularly 2PC stuff, so consider it educated guesswork. Thanks for this, and the test case! Your commit fixes the issues, but not quite correctly. Multixacts shouldn't matter, so we don't need to do anything there. And for the increases, I think they should be inside the already existing ProcArrayLock acquisition, as in the attached. I've also included a quite heavily revised version of your test. Instead of using dblink I went for having a long-running psql that I feed over stdin. The main reason for not liking the previous version is that it seems fragile, with the sleep and everything. I expanded it to cover 2PC is as well. The test probably needs a bit of cleanup, wrapping some of the redundancy around the pump_until calls. I think the approach of having a long running psql session is really useful, and probably would speed up some tests. Does anybody have a good idea for how to best, and without undue effort, to integrate this into PostgresNode.pm? I don't really have a great idea, so I think I'd leave it with a local helper in the new test? Regards, Andres >From a637b65fc53b208857e0d3d17141d8ed3609036f Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 14 Sep 2020 16:08:25 -0700 Subject: [PATCH v2] WIP: fix and test snapshot behaviour on standby. Reported-By: Ian Barwick Author: Andres Freund Author: Ian Barwick Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb...@2ndquadrant.com --- src/backend/storage/ipc/procarray.c | 3 + src/test/recovery/t/021_row_visibility.pl | 227 ++ 2 files changed, 230 insertions(+) create mode 100644 src/test/recovery/t/021_row_visibility.pl diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 802b119c490..fffa5f7a93e 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -4280,6 +4280,9 @@ ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids, /* As in ProcArrayEndTransaction, advance latestCompletedXid */ MaintainLatestCompletedXidRecovery(max_xid); + /* ... and xactCompletionCount */ + ShmemVariableCache->xactCompletionCount++; + LWLockRelease(ProcArrayLock); } diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl new file mode 100644 index 000..08713fa2686 --- /dev/null +++ b/src/test/recovery/t/021_row_visibility.pl @@ -0,0 +1,227 @@ +# Checks that a standby session can see all expected rows +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 10; + +# Initialize primary node +my $node_primary = get_new_node('primary'); +$node_primary->init(allows_streaming => 1); +$node_primary->append_conf('postgresql.conf', 'max_prepared_transactions=10'); +$node_primary->start; + +# Initialize with empty test table +$node_primary->safe_psql('postgres', + 'CREATE TABLE public.test_visibility (data text not null)'); + +# Take backup +my $backup_name = 'my_backup'; +$node_primary->backup($backup_name); + +# Create streaming standby from backup +my $node_standby = get_new_node('standby'); +$node_standby->init_from_backup($node_primary, $backup_name, + has_streaming => 1); +$node_standby->append_conf('postgresql.conf', 'max_prepared_transactions=10'); +$node_standby->start; + +# To avoid hanging while expecting some specific input from a psql +# instance being driven by us, add a timeout high enough that it +# should never trigger even on very slow machines, unless something +# is really wrong. +my $psql_timeout = IPC::Run::timer(5); + + +# One psql to primary for all queries. That allows to check +# uncommitted changes being replicated and such. +my ($psql_primary_stdin, $psql_primary_stdout, $psql_primary_stderr) = ('', '', ''); +my $psql_primary = IPC::Run::start( + [ + 'psql', '-X', '-qAe', '-f', '-', '-d', + $node_primary->connstr('postgres') + ], + '<', + \$psql_primary_stdin, + '>', + \$psql_primary_stdout, + '2>', + \$psql_primary_stderr, + $psql_timeout); + +# One psql to standby for all queries. That allows to reuse the same +# session for multiple queries, which is important to detect some +# types of errors. +my ($psql_standby_stdin, $psql_standby_stdout, $psql_standby_stderr) = ('', '', ''); +my $psql_standby = IPC::Run::start( + [ + 'psql', '-X', '-qAe', '-f', '-', '-d', + $node_standby->connstr('postgres') + ], + '<', + \$psql_standby_stdin, + '>', + \$psql_standby_stdout, + '2>', + \$psql_standby_stderr, + $psql_timeout); + +# +# 1. Check initial data is the same +# +$psql_standby_stdin .= q[ +SELECT * FROM test_visibility ORDER BY data; + ]; +ok(pump_until($psql_standby, \$psql_standby_stdout, qr/0 rows/m), + 'data not visible'); +$psql_standby_stdout = ''; +$psql_standby_stderr = ''; + + +# +# 2. Check if an INSERT is replayed and visible +#
Re: Improving connection scalability: GetSnapshotData()
On 2020/09/08 13:23, Ian Barwick wrote: On 2020/09/08 13:11, Andres Freund wrote: Hi, On 2020-09-08 13:03:01 +0900, Ian Barwick wrote: (...) I wonder if it's possible to increment "xactCompletionCount" during replay along these lines: *** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *** xact_redo_commit(xl_xact_parsed_commit * *** 5915,5920 --- 5915,5924 */ if (XactCompletionApplyFeedback(parsed->xinfo)) XLogRequestWalReceiverReply(); + + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + ShmemVariableCache->xactCompletionCount++; + LWLockRelease(ProcArrayLock); } which seems to work (though quite possibly I've overlooked something I don't know that I don't know about and it will all break horribly somewhere, etc. etc.). We'd also need the same in a few more places. Probably worth looking at the list where we increment it on the primary (particularly we need to also increment it for aborts, and 2pc commit/aborts). Yup. At first I was very confused as to why none of the existing tests have found this significant issue. But after thinking about it for a minute that's because they all use psql, and largely separate psql invocations for each query :(. Which means that there's no cached snapshot around... Do you want to try to write a patch? Sure, I'll give it a go as I have some time right now. Attached, though bear in mind I'm not very familiar with parts of this, particularly 2PC stuff, so consider it educated guesswork. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services commit 544e2b1661413fe08e3083f03063c12c0d7cf3aa Author: Ian Barwick Date: Tue Sep 8 12:24:14 2020 +0900 Fix snapshot caching on standbys Addresses issue introduced in 623a9ba. diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index b8bedca04a..227d03bbce 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -3320,6 +3320,14 @@ multixact_redo(XLogReaderState *record) } else elog(PANIC, "multixact_redo: unknown op code %u", info); + + /* + * Advance xactCompletionCount so rebuilds of snapshot contents + * can be triggered. + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + ShmemVariableCache->xactCompletionCount++; + LWLockRelease(ProcArrayLock); } Datum diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index af6afcebb1..04ca858918 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -5915,6 +5915,14 @@ xact_redo_commit(xl_xact_parsed_commit *parsed, */ if (XactCompletionApplyFeedback(parsed->xinfo)) XLogRequestWalReceiverReply(); + + /* + * Advance xactCompletionCount so rebuilds of snapshot contents + * can be triggered. + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + ShmemVariableCache->xactCompletionCount++; + LWLockRelease(ProcArrayLock); } /* @@ -5978,6 +5986,14 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid) /* Make sure files supposed to be dropped are dropped */ DropRelationFiles(parsed->xnodes, parsed->nrels, true); + + /* + * Advance xactCompletionCount so rebuilds of snapshot contents + * can be triggered. + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + ShmemVariableCache->xactCompletionCount++; + LWLockRelease(ProcArrayLock); } void
Re: Improving connection scalability: GetSnapshotData()
On 2020/09/09 2:53, Andres Freund wrote: Hi, On 2020-09-08 16:44:17 +1200, Thomas Munro wrote: On Tue, Sep 8, 2020 at 4:11 PM Andres Freund wrote: At first I was very confused as to why none of the existing tests have found this significant issue. But after thinking about it for a minute that's because they all use psql, and largely separate psql invocations for each query :(. Which means that there's no cached snapshot around... I prototyped a TAP test patch that could maybe do the sort of thing you need, in patch 0006 over at [1]. Later versions of that patch set dropped it, because I figured out how to use the isolation tester instead, but I guess you can't do that for a standby test (at least not until someone teaches the isolation tester to support multi-node schedules, something that would be extremely useful...). Unfortunately proper multi-node isolationtester test basically is equivalent to building a global lock graph. I think, at least? Including a need to be able to correlate connections with their locks between the nodes. But for something like the bug at hand it'd probably sufficient to just "hack" something with dblink. In session 1) insert a row on the primary using dblink, return the LSN, wait for the LSN to have replicated and finally in session 2) check for row visibility. The attached seems to do the trick. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services commit b31d587d71f75115b02dd1bf6230a56722c67832 Author: Ian Barwick Date: Wed Sep 9 14:37:40 2020 +0900 test for standby row visibility diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile index fa8e031526..2d9a9701fc 100644 --- a/src/test/recovery/Makefile +++ b/src/test/recovery/Makefile @@ -9,7 +9,7 @@ # #- -EXTRA_INSTALL=contrib/test_decoding +EXTRA_INSTALL=contrib/test_decoding contrib/dblink subdir = src/test/recovery top_builddir = ../../.. diff --git a/src/test/recovery/t/021_row_visibility.pl b/src/test/recovery/t/021_row_visibility.pl new file mode 100644 index 00..5f591d131e --- /dev/null +++ b/src/test/recovery/t/021_row_visibility.pl @@ -0,0 +1,84 @@ +# Checks that a standby session can see all expected rows +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 1; + +# Initialize primary node +my $node_primary = get_new_node('primary'); +$node_primary->init(allows_streaming => 1); +$node_primary->start; + +$node_primary->safe_psql('postgres', + 'CREATE EXTENSION dblink'); + + +# Add an arbitrary table +$node_primary->safe_psql('postgres', + 'CREATE TABLE public.foo (id INT)'); + +# Take backup +my $backup_name = 'my_backup'; +$node_primary->backup($backup_name); + +# Create streaming standby from backup +my $node_standby = get_new_node('standby'); +$node_standby->init_from_backup($node_primary, $backup_name, + has_streaming => 1); +$node_standby->start; + +sleep(5); +# Check row visibility in an existing standby session + +my ($res, $stdout, $stderr) = $node_standby->psql( +'postgres', +sprintf( +<<'EO_SQL', +DO $$ + DECLARE +primary_lsn pg_lsn; +insert_xmin xid; +standby_rec RECORD; + BEGIN +SELECT INTO primary_lsn, insert_xmin + t1.primary_lsn, t1.xmin + FROM dblink( + 'host=%s port=%i dbname=postgres', + 'INSERT INTO public.foo VALUES (1) RETURNING pg_catalog.pg_current_wal_lsn(), xmin' + ) AS t1(primary_lsn pg_lsn, xmin xid); + +LOOP + EXIT WHEN pg_catalog.pg_last_wal_replay_lsn() > primary_lsn; +END LOOP; + +SELECT INTO standby_rec + id + FROM public.foo + WHERE id = 1 AND xmin = insert_xmin; + +IF FOUND + THEN +RAISE NOTICE 'row found'; + ELSE +RAISE NOTICE 'row not found'; +END IF; + + END; +$$; +EO_SQL +$node_primary->host, +$node_primary->port, +), +); + + +like ( +$stderr, +qr/row found/, +'check that inserted row is visible on the standby', +); + +$node_primary->stop; +$node_standby->stop;
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-06-07 11:24:50 +0300, Michail Nikolaev wrote: > Hello, hackers. > Andres, nice work! > > Sorry for the off-top. > > Some of my work [1] related to the support of index hint bits on > standby is highly interfering with this patch. > Is it safe to consider it committed and start rebasing on top of the patches? Sorry, I missed this email. Since they're now committed, yes, it is safe ;) Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-08 16:44:17 +1200, Thomas Munro wrote: > On Tue, Sep 8, 2020 at 4:11 PM Andres Freund wrote: > > At first I was very confused as to why none of the existing tests have > > found this significant issue. But after thinking about it for a minute > > that's because they all use psql, and largely separate psql invocations > > for each query :(. Which means that there's no cached snapshot around... > > I prototyped a TAP test patch that could maybe do the sort of thing > you need, in patch 0006 over at [1]. Later versions of that patch set > dropped it, because I figured out how to use the isolation tester > instead, but I guess you can't do that for a standby test (at least > not until someone teaches the isolation tester to support multi-node > schedules, something that would be extremely useful...). Unfortunately proper multi-node isolationtester test basically is equivalent to building a global lock graph. I think, at least? Including a need to be able to correlate connections with their locks between the nodes. But for something like the bug at hand it'd probably sufficient to just "hack" something with dblink. In session 1) insert a row on the primary using dblink, return the LSN, wait for the LSN to have replicated and finally in session 2) check for row visibility. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 07.09.2020 23:45, Andres Freund wrote: Hi, On Mon, Sep 7, 2020, at 07:20, Konstantin Knizhnik wrote: And which pgbench database scale factor you have used? 200 Another thing you could try is to run 2-4 pgench instances in different databases. I tried to reinitialize database with scale 200 but there was no significant improvement in performance. If you're replying to the last bit I am quoting, I was talking about having four databases with separate pbench tables etc. To see how much of it is procarray contention, and how much it is contention of common buffers etc. Sorry, I have tested hypothesis that the difference in performance in my and you cases can be explained by size of the table which can have influence on shared buffer contention. Thus is why I used the same scale as you, but there is no difference compatring with scale 100. And definitely Postgres performance in this test is limited by lock contention (most likely shared buffers locks, rather than procarray locks). If I create two instances of postgres, both with pgbench -s 200 database and run two pgbenches with 100 connections each, then each instance shows the same ~1million TPS (1186483) as been launched standalone. And total TPS is 2.3 millions. Attachments: * pgbench.svg What numactl was used for this one? None. I have not used numactl in this case. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Improving connection scalability: GetSnapshotData()
On Tue, Sep 8, 2020 at 4:11 PM Andres Freund wrote: > At first I was very confused as to why none of the existing tests have > found this significant issue. But after thinking about it for a minute > that's because they all use psql, and largely separate psql invocations > for each query :(. Which means that there's no cached snapshot around... I prototyped a TAP test patch that could maybe do the sort of thing you need, in patch 0006 over at [1]. Later versions of that patch set dropped it, because I figured out how to use the isolation tester instead, but I guess you can't do that for a standby test (at least not until someone teaches the isolation tester to support multi-node schedules, something that would be extremely useful...). Example: +# start an interactive session that we can use to interleave statements +my $session = PsqlSession->new($node, "postgres"); +$session->send("\\set PROMPT1 ''\n", 2); +$session->send("\\set PROMPT2 ''\n", 1); ... +# our snapshot is not too old yet, so we can still use it +@lines = $session->send("select * from t order by i limit 1;\n", 2); +shift @lines; +$result = shift @lines; +is($result, "1"); ... +# our snapshot is too old! the thing it wants to see has been removed +@lines = $session->send("select * from t order by i limit 1;\n", 2); +shift @lines; +$result = shift @lines; +is($result, "ERROR: snapshot too old"); [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BFkUuDv-bcBns%3DZ_O-V9QGW0nWZNHOkEPxHZWjegRXvw%40mail.gmail.com
Re: Improving connection scalability: GetSnapshotData()
On 2020/09/08 13:11, Andres Freund wrote: Hi, On 2020-09-08 13:03:01 +0900, Ian Barwick wrote: (...) I wonder if it's possible to increment "xactCompletionCount" during replay along these lines: *** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *** xact_redo_commit(xl_xact_parsed_commit * *** 5915,5920 --- 5915,5924 */ if (XactCompletionApplyFeedback(parsed->xinfo)) XLogRequestWalReceiverReply(); + + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + ShmemVariableCache->xactCompletionCount++; + LWLockRelease(ProcArrayLock); } which seems to work (though quite possibly I've overlooked something I don't know that I don't know about and it will all break horribly somewhere, etc. etc.). We'd also need the same in a few more places. Probably worth looking at the list where we increment it on the primary (particularly we need to also increment it for aborts, and 2pc commit/aborts). Yup. At first I was very confused as to why none of the existing tests have found this significant issue. But after thinking about it for a minute that's because they all use psql, and largely separate psql invocations for each query :(. Which means that there's no cached snapshot around... Do you want to try to write a patch? Sure, I'll give it a go as I have some time right now. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-08 13:03:01 +0900, Ian Barwick wrote: > On 2020/09/03 17:18, Michael Paquier wrote: > > On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote: > > > So we get some builfarm results while thinking about this. > > > > Andres, there is an entry in the CF for this thread: > > https://commitfest.postgresql.org/29/2500/ > > > > A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc. > > I haven't seen it mentioned here, so apologies if I've overlooked > something, but as of 623a9ba queries on standbys seem somewhat > broken. > > Specifically, I maintain some code which does something like this: > > - connects to a standby > - checks a particular row does not exist on the standby > - connects to the primary > - writes a row in the primary > - polls the standby (using the same connection as above) > to verify the row arrives on the standby > > As of recent HEAD it never sees the row arrive on the standby, even > though it is verifiably there. Ugh, that's not good. > I've traced this back to 623a9ba, which relies on "xactCompletionCount" > being incremented to determine whether the snapshot can be reused, > but that never happens on a standby, meaning this test in > GetSnapshotDataReuse(): > > if (curXactCompletionCount != snapshot->snapXactCompletionCount) > return false; > > will never return false, and the snapshot's xmin/xmax never get advanced. > Which means the session on the standby is not able to see rows on the > standby added after the session was started. > > It's simple enough to prevent that being an issue by just never calling > GetSnapshotDataReuse() if the snapshot was taken during recovery > (though obviously that means any performance benefits won't be available > on standbys). Yea, that doesn't sound great. Nor is the additional branch welcome. > I wonder if it's possible to increment "xactCompletionCount" > during replay along these lines: > > *** a/src/backend/access/transam/xact.c > --- b/src/backend/access/transam/xact.c > *** xact_redo_commit(xl_xact_parsed_commit * > *** 5915,5920 > --- 5915,5924 > */ > if (XactCompletionApplyFeedback(parsed->xinfo)) > XLogRequestWalReceiverReply(); > + > + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); > + ShmemVariableCache->xactCompletionCount++; > + LWLockRelease(ProcArrayLock); > } > > which seems to work (though quite possibly I've overlooked something I don't > know that I don't know about and it will all break horribly somewhere, > etc. etc.). We'd also need the same in a few more places. Probably worth looking at the list where we increment it on the primary (particularly we need to also increment it for aborts, and 2pc commit/aborts). At first I was very confused as to why none of the existing tests have found this significant issue. But after thinking about it for a minute that's because they all use psql, and largely separate psql invocations for each query :(. Which means that there's no cached snapshot around... Do you want to try to write a patch? Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 2020/09/03 17:18, Michael Paquier wrote: On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote: So we get some builfarm results while thinking about this. Andres, there is an entry in the CF for this thread: https://commitfest.postgresql.org/29/2500/ A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc. I haven't seen it mentioned here, so apologies if I've overlooked something, but as of 623a9ba queries on standbys seem somewhat broken. Specifically, I maintain some code which does something like this: - connects to a standby - checks a particular row does not exist on the standby - connects to the primary - writes a row in the primary - polls the standby (using the same connection as above) to verify the row arrives on the standby As of recent HEAD it never sees the row arrive on the standby, even though it is verifiably there. I've traced this back to 623a9ba, which relies on "xactCompletionCount" being incremented to determine whether the snapshot can be reused, but that never happens on a standby, meaning this test in GetSnapshotDataReuse(): if (curXactCompletionCount != snapshot->snapXactCompletionCount) return false; will never return false, and the snapshot's xmin/xmax never get advanced. Which means the session on the standby is not able to see rows on the standby added after the session was started. It's simple enough to prevent that being an issue by just never calling GetSnapshotDataReuse() if the snapshot was taken during recovery (though obviously that means any performance benefits won't be available on standbys). I wonder if it's possible to increment "xactCompletionCount" during replay along these lines: *** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *** xact_redo_commit(xl_xact_parsed_commit * *** 5915,5920 --- 5915,5924 */ if (XactCompletionApplyFeedback(parsed->xinfo)) XLogRequestWalReceiverReply(); + + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + ShmemVariableCache->xactCompletionCount++; + LWLockRelease(ProcArrayLock); } which seems to work (though quite possibly I've overlooked something I don't know that I don't know about and it will all break horribly somewhere, etc. etc.). Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi, On Mon, Sep 7, 2020, at 07:20, Konstantin Knizhnik wrote: > >> And which pgbench database scale factor you have used? > > 200 > > > > Another thing you could try is to run 2-4 pgench instances in different > > databases. > I tried to reinitialize database with scale 200 but there was no > significant improvement in performance. If you're replying to the last bit I am quoting, I was talking about having four databases with separate pbench tables etc. To see how much of it is procarray contention, and how much it is contention of common buffers etc. > Attachments: > * pgbench.svg What numactl was used for this one?
Re: Improving connection scalability: GetSnapshotData()
On 06.09.2020 21:52, Andres Freund wrote: Hi, On 2020-09-05 16:58:31 +0300, Konstantin Knizhnik wrote: On 04.09.2020 21:53, Andres Freund wrote: I also used huge_pages=on / configured them on the OS level. Otherwise TLB misses will be a significant factor. As far as I understand there should not be no any TLB misses because size of the shared buffers (8Mb) as several order of magnitude smaler that available physical memory. I assume you didn't mean 8MB but 8GB? If so, that's way large enough to be bigger than the TLB, particularly across processes (IIRC there's no optimization to keep shared mappings de-duplicated between processes from the view of the TLB). Sorry, certainly 8Gb. I tried huge pages, but it has almost no effect/
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-06 14:05:35 +0300, Konstantin Knizhnik wrote: > On 04.09.2020 21:53, Andres Freund wrote: > > > > > May be it is because of more complex architecture of my server? > > Think we'll need profiles to know... > > This is "perf top" of pgebch -c 100 -j 100 -M prepared -S > > 12.16% postgres [.] PinBuffer > 11.92% postgres [.] LWLockAttemptLock > 6.46% postgres [.] UnpinBuffer.constprop.11 > 6.03% postgres [.] LWLockRelease > 3.14% postgres [.] BufferGetBlockNumber > 3.04% postgres [.] ReadBuffer_common > 2.13% [kernel] [k] _raw_spin_lock_irqsave > 2.11% [kernel] [k] switch_mm_irqs_off > 1.95% postgres [.] _bt_compare > > > Looks like most of the time is pent in buffers locks. Hm, that is interesting / odd. If you record a profile with call graphs (e.g. --call-graph dwarf), where are all the LWLockAttemptLock calls comming from? I assume the machine you're talking about is an 8 socket machine? What if you: a) start postgres and pgbench with numactl --interleave=all b) start postgres with numactl --interleave=0,1 --cpunodebind=0,1 --membind=0,1 in case you have 4 sockets, or 0,1,2,3 in case you have 8 sockets? > And which pgbench database scale factor you have used? 200 Another thing you could try is to run 2-4 pgench instances in different databases. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-05 16:58:31 +0300, Konstantin Knizhnik wrote: > On 04.09.2020 21:53, Andres Freund wrote: > > > > I also used huge_pages=on / configured them on the OS level. Otherwise > > TLB misses will be a significant factor. > > As far as I understand there should not be no any TLB misses because size of > the shared buffers (8Mb) as several order of magnitude smaler that available > physical memory. I assume you didn't mean 8MB but 8GB? If so, that's way large enough to be bigger than the TLB, particularly across processes (IIRC there's no optimization to keep shared mappings de-duplicated between processes from the view of the TLB). > Yes, there is also noticeable difference in my case > > | Idle Connections | Active Connections | TPS pre | TPS post | > |-:|---:|:|-:| > | 5000 | 48 | 758914 | 1184085 | Sounds like you're somehow hitting another bottleneck around 1.2M TPS Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 04.09.2020 21:53, Andres Freund wrote: May be it is because of more complex architecture of my server? Think we'll need profiles to know... This is "perf top" of pgebch -c 100 -j 100 -M prepared -S 12.16% postgres [.] PinBuffer 11.92% postgres [.] LWLockAttemptLock 6.46% postgres [.] UnpinBuffer.constprop.11 6.03% postgres [.] LWLockRelease 3.14% postgres [.] BufferGetBlockNumber 3.04% postgres [.] ReadBuffer_common 2.13% [kernel] [k] _raw_spin_lock_irqsave 2.11% [kernel] [k] switch_mm_irqs_off 1.95% postgres [.] _bt_compare Looks like most of the time is pent in buffers locks. And which pgbench database scale factor you have used?
Re: Improving connection scalability: GetSnapshotData()
On 04.09.2020 21:53, Andres Freund wrote: I also used huge_pages=on / configured them on the OS level. Otherwise TLB misses will be a significant factor. As far as I understand there should not be no any TLB misses because size of the shared buffers (8Mb) as several order of magnitude smaler that available physical memory. Does it change if you initialize the test database using PGOPTIONS='-c vacuum_freeze_min_age=0' pgbench -i -s 100 or run a manual VACUUM FREEZE; after initialization? I tried it, but didn't see any improvement. Hm, it'd probably be good to compare commits closer to the changes, to avoid other changes showing up. Hm - did you verify if all the connections were actually established? Particularly without the patch applied? With an unmodified pgbench, I sometimes saw better numbers, but only because only half the connections were able to be established, due to ProcArrayLock contention. Yes, that really happen quite often at IBM Power2 server (specific of it's atomic implementation). I even have to patch pgbench by adding one second delay after connection has been established to make it possible for all clients to connect. But at Intel server I didn't see unconnected clients. And in any case - it happen only for large number of connections (> 1000). But the best performance was achieved at about 100 connections and still I can not reach 2k TPS performance a in your case. Did you connect via tcp or unix socket? Was pgbench running on the same machine? It was locally via unix socket for me (but it's also observable via two machines, just with lower overall throughput). Pgbench was launched at the same machine and connected through unix sockets. Did you run a profile to see where the bottleneck is? Sorry I do not have root privileges at this server and so can not use perf. There's a seperate benchmark that I found to be quite revealing that's far less dependent on scheduler behaviour. Run two pgbench instances: 1) With a very simply script '\sleep 1s' or such, and many connections (e.g. 100,1000,5000). That's to simulate connections that are currently idle. 2) With a normal pgbench read only script, and low client counts. Before the changes 2) shows a very sharp decline in performance when the count in 1) increases. Afterwards its pretty much linear. I think this benchmark actually is much more real world oriented - due to latency and client side overheads it's very normal to have a large fraction of connections idle in read mostly OLTP workloads. Here's the result on my workstation (2x Xeon Gold 5215 CPUs), testing 1f42d35a1d6144a23602b2c0bc7f97f3046cf890 against 07f32fcd23ac81898ed47f88beb569c631a2f223 which are the commits pre/post connection scalability changes. I used fairly short pgbench runs (15s), and the numbers are the best of three runs. I also had emacs and mutt open - some noise to be expected. But I also gotta work ;) | Idle Connections | Active Connections | TPS pre | TPS post | |-:|---:|:|-:| |0 | 1 | 33599 |33406 | | 100 | 1 | 31088 |33279 | | 1000 | 1 | 29377 |33434 | | 2500 | 1 | 27050 |33149 | | 5000 | 1 | 21895 |33903 | |1 | 1 | 16034 |33140 | |0 | 48 | 1042005 | 1125104 | | 100 | 48 | 986731 | 1103584 | | 1000 | 48 | 854230 | 1119043 | | 2500 | 48 | 716624 | 1119353 | | 5000 | 48 | 553657 | 1119476 | |1 | 48 | 369845 | 1115740 | Yes, there is also noticeable difference in my case | Idle Connections | Active Connections | TPS pre | TPS post | |-:|---:|:|-:| | 5000 | 48 | 758914 | 1184085 | Think we'll need profiles to know... I will try to obtain sudo permissions and do profiling.
Re: Improving connection scalability: GetSnapshotData()
On Fri, Sep 04, 2020 at 10:35:19AM -0700, Andres Freund wrote: > I think it's best to close the entry. There's substantial further wins > possible, in particular not acquiring ProcArrayLock in GetSnapshotData() > when the cache is valid improves performance substantially. But it's > non-trivial enough that it's probably best dealth with in a separate > patch / CF entry. Cool, thanks for updating the CF entry. -- Michael signature.asc Description: PGP signature
Re: Improving connection scalability: GetSnapshotData()
On 2020-09-04 11:53:04 -0700, Andres Freund wrote: > There's a seperate benchmark that I found to be quite revealing that's > far less dependent on scheduler behaviour. Run two pgbench instances: > > 1) With a very simply script '\sleep 1s' or such, and many connections >(e.g. 100,1000,5000). That's to simulate connections that are >currently idle. > 2) With a normal pgbench read only script, and low client counts. > > Before the changes 2) shows a very sharp decline in performance when the > count in 1) increases. Afterwards its pretty much linear. > > I think this benchmark actually is much more real world oriented - due > to latency and client side overheads it's very normal to have a large > fraction of connections idle in read mostly OLTP workloads. > > Here's the result on my workstation (2x Xeon Gold 5215 CPUs), testing > 1f42d35a1d6144a23602b2c0bc7f97f3046cf890 against > 07f32fcd23ac81898ed47f88beb569c631a2f223 which are the commits pre/post > connection scalability changes. > > I used fairly short pgbench runs (15s), and the numbers are the best of > three runs. I also had emacs and mutt open - some noise to be > expected. But I also gotta work ;) > > | Idle Connections | Active Connections | TPS pre | TPS post | > |-:|---:|:|-:| > |0 | 1 | 33599 |33406 | > | 100 | 1 | 31088 |33279 | > | 1000 | 1 | 29377 |33434 | > | 2500 | 1 | 27050 |33149 | > | 5000 | 1 | 21895 |33903 | > |1 | 1 | 16034 |33140 | > |0 | 48 | 1042005 | 1125104 | > | 100 | 48 | 986731 | 1103584 | > | 1000 | 48 | 854230 | 1119043 | > | 2500 | 48 | 716624 | 1119353 | > | 5000 | 48 | 553657 | 1119476 | > |1 | 48 | 369845 | 1115740 | Attached in graph form. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-04 18:24:12 +0300, Konstantin Knizhnik wrote: > Reported results looks very impressive. > But I tried to reproduce them and didn't observed similar behavior. > So I am wondering what can be the difference and what I am doing wrong. That is odd - I did reproduce it on quite a few systems by now. > Configuration file has the following differences with default postgres config: > > max_connections = 1 # (change requires restart) > shared_buffers = 8GB # min 128kB I also used huge_pages=on / configured them on the OS level. Otherwise TLB misses will be a significant factor. Does it change if you initialize the test database using PGOPTIONS='-c vacuum_freeze_min_age=0' pgbench -i -s 100 or run a manual VACUUM FREEZE; after initialization? > I have tried two different systems. > First one is IBM Power2 server with 384 cores and 8Tb of RAM. > I run the same read-only pgbench test as you. I do not think that size of the > database is matter, so I used scale 100 - > it seems to be enough to avoid frequent buffer conflicts. > Then I run the same scripts as you: > > for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T > 100 -j $n -M prepared -S -n postgres ; done > for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T > 100 -j $n -M prepared -S -n postgres ; done > > > I have compared current master with version of Postgres prior to your commits > with scalability improvements: a9a4a7ad56 Hm, it'd probably be good to compare commits closer to the changes, to avoid other changes showing up. Hm - did you verify if all the connections were actually established? Particularly without the patch applied? With an unmodified pgbench, I sometimes saw better numbers, but only because only half the connections were able to be established, due to ProcArrayLock contention. See https://www.postgresql.org/message-id/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de There also is the issue that pgbench numbers for inclusive/exclusive are just about meaningless right now: https://www.postgresql.org/message-id/20200227202636.qaf7o6qcajsudoor%40alap3.anarazel.de (reminds me, need to get that fixed) One more thing worth investigating is whether your results change significantly when you start the server using numactl --interleave=all . Especially on larger systems the results otherwise can vary a lot from run-to-run, because the placement of shared buffers matters a lot. > So I have repeated experiments at Intel server. > It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM. > > The same database, the same script, results are the following: > > Clients old/inc old/exl new/inc new/exl > 1000 1105750 1163292 1206105 1212701 > 2000 1050933 1124688 1149706 1164942 > 3000 1063667 1195158 1118087 1144216 > 4000 1040065 1290432 1107348 1163906 > 5000 943813 1258643 1103790 1160251 > I have separately show results including/excluding connection connections > establishing, > because in new version there are almost no differences between them, > but for old version gap between them is noticeable. > > Configuration file has the following differences with default postgres config: > > max_connections = 1 # (change requires restart) > shared_buffers = 8GB # min 128kB > > This results contradict with yours and makes me ask the following questions: > 1. Why in your case performance is almost two times larger (2 millions vs 1)? > The hardware in my case seems to be at least not worser than yours... > May be there are some other improvements in the version you have tested which > are not yet committed to master? No, no uncommitted changes, except for the pgbench stuff mentioned above. However I found that the kernel version matters a fair bit, it's pretty easy to run into kernel scalability issues in a workload that is this heavy scheduler dependent. Did you connect via tcp or unix socket? Was pgbench running on the same machine? It was locally via unix socket for me (but it's also observable via two machines, just with lower overall throughput). Did you run a profile to see where the bottleneck is? There's a seperate benchmark that I found to be quite revealing that's far less dependent on scheduler behaviour. Run two pgbench instances: 1) With a very simply script '\sleep 1s' or such, and many connections (e.g. 100,1000,5000). That's to simulate connections that are currently idle. 2) With a normal pgbench read only script, and low client counts. Before the changes 2) shows a very sharp decline in performance when the count in 1) increases. Afterwards its pretty much linear. I think this benchmark actually is much more real world oriented - due to latency and client side overheads it's very normal
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-09-03 17:18:29 +0900, Michael Paquier wrote: > On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote: > > So we get some builfarm results while thinking about this. > > Andres, there is an entry in the CF for this thread: > https://commitfest.postgresql.org/29/2500/ > > A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc. > Now that PGXACT is done, how much work is remaining here? I think it's best to close the entry. There's substantial further wins possible, in particular not acquiring ProcArrayLock in GetSnapshotData() when the cache is valid improves performance substantially. But it's non-trivial enough that it's probably best dealth with in a separate patch / CF entry. Closed. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On 03.09.2020 11:18, Michael Paquier wrote: On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote: So we get some builfarm results while thinking about this. Andres, there is an entry in the CF for this thread: https://commitfest.postgresql.org/29/2500/ A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc. Now that PGXACT is done, how much work is remaining here? -- Michael Andres, First of all a lot of thanks for this work. Improving Postgres connection scalability is very important. Reported results looks very impressive. But I tried to reproduce them and didn't observed similar behavior. So I am wondering what can be the difference and what I am doing wrong. I have tried two different systems. First one is IBM Power2 server with 384 cores and 8Tb of RAM. I run the same read-only pgbench test as you. I do not think that size of the database is matter, so I used scale 100 - it seems to be enough to avoid frequent buffer conflicts. Then I run the same scripts as you: for ((n=100; n < 1000; n+=100)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ; done for ((n=1000; n <= 5000; n+=1000)); do echo $n; pgbench -M prepared -c $n -T 100 -j $n -M prepared -S -n postgres ; done I have compared current master with version of Postgres prior to your commits with scalability improvements: a9a4a7ad56 For all number of connections older version shows slightly better results, for example for 500 clients: 475k TPS vs. 450k TPS for current master. This is quite exotic server and I do not have currently access to it. So I have repeated experiments at Intel server. It has 160 cores Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and 256Gb of RAM. The same database, the same script, results are the following: Clients old/inc old/exl new/inc new/exl 10001105750 1163292 1206105 1212701 20001050933 1124688 1149706 1164942 30001063667 1195158 1118087 1144216 40001040065 1290432 1107348 1163906 5000943813 1258643 1103790 1160251 I have separately show results including/excluding connection connections establishing, because in new version there are almost no differences between them, but for old version gap between them is noticeable. Configuration file has the following differences with default postgres config: max_connections = 1 # (change requires restart) shared_buffers = 8GB# min 128kB This results contradict with yours and makes me ask the following questions: 1. Why in your case performance is almost two times larger (2 millions vs 1)? The hardware in my case seems to be at least not worser than yours... May be there are some other improvements in the version you have tested which are not yet committed to master? 2. You wrote: This is on a machine with 2 Intel(R) Xeon(R) Platinum 8168, but virtualized (2 sockets of 18 cores/36 threads) According to Intel specification Intel® Xeon® Platinum 8168 Processor has 24 cores: https://ark.intel.com/content/www/us/en/ark/products/120504/intel-xeon-platinum-8168-processor-33m-cache-2-70-ghz.html And at your graph we can see almost linear increase of speed up to 40 connections. But most suspicious word for me is "virtualized". What is the actual hardware and how it is virtualized? Do you have any idea why in my case master version (with your commits) behaves almost the same as non-patched version? Below is yet another table showing scalability from 10 to 100 connections and combining your results (first two columns) and my results (last two columns): Clients old master pgxact-split-cache current master revision 9a4a7ad56 10 367883 375682 358984 347067 20 748000 810964 668631 630304 30 999231 1288276 920255 848244 40 991672 1573310 1100745 970717 50 1017561 1715762 1193928 1008755 60 993943 1789698 1255629 917788 70 971379 1819477 1277634 873022 80 966276 1842248 1266523 830197 90 901175 1847823 1255260 736550 100 803175 1865795 1241143 736756 May be it is because of more complex architecture of my server? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Improving connection scalability: GetSnapshotData()
On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote: > So we get some builfarm results while thinking about this. Andres, there is an entry in the CF for this thread: https://commitfest.postgresql.org/29/2500/ A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc. Now that PGXACT is done, how much work is remaining here? -- Michael signature.asc Description: PGP signature
Re: Improving connection scalability: GetSnapshotData()
On 2020-Aug-16, Peter Geoghegan wrote: > On Sun, Aug 16, 2020 at 2:11 PM Andres Freund wrote: > > For the first, one issue is that there's no obviously good candidate for > > an uninitialized xid. We could use something like FrozenTransactionId, > > which may never be in the procarray. But it's not exactly pretty. > > Maybe it would make sense to mark the fields as inaccessible or > undefined to Valgrind. That has advantages and disadvantages that are > obvious. ... and perhaps making Valgrind complain about it is sufficient. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-08-16 17:28:46 -0400, Tom Lane wrote: > Andres Freund writes: > > For the first, one issue is that there's no obviously good candidate for > > an uninitialized xid. We could use something like FrozenTransactionId, > > which may never be in the procarray. But it's not exactly pretty. > > Huh? What's wrong with using InvalidTransactionId? It's a normal value for a backend when it doesn't have an xid assigned. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Andres Freund writes: > For the first, one issue is that there's no obviously good candidate for > an uninitialized xid. We could use something like FrozenTransactionId, > which may never be in the procarray. But it's not exactly pretty. Huh? What's wrong with using InvalidTransactionId? regards, tom lane
Re: Improving connection scalability: GetSnapshotData()
On Sun, Aug 16, 2020 at 2:11 PM Andres Freund wrote: > For the first, one issue is that there's no obviously good candidate for > an uninitialized xid. We could use something like FrozenTransactionId, > which may never be in the procarray. But it's not exactly pretty. Maybe it would make sense to mark the fields as inaccessible or undefined to Valgrind. That has advantages and disadvantages that are obvious. If that isn't enough, it might not hurt to do this on top of whatever becomes the primary solution. An undefined value has the advantage of "spreading" when the value gets copied around. -- Peter Geoghegan
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-08-16 14:11:46 -0700, Andres Freund wrote: > On 2020-08-16 13:52:58 -0700, Andres Freund wrote: > > On 2020-08-16 13:31:53 -0700, Andres Freund wrote: > > Gna, I think I see the problem. In at least one place I wrongly > > accessed the 'dense' array of in-progress xids using the 'pgprocno', > > instead of directly using the [0...procArray->numProcs) index. > > > > Working on a fix, together with some improved asserts. > > diff --git i/src/backend/storage/ipc/procarray.c > w/src/backend/storage/ipc/procarray.c > index 8262abd42e6..96e4a878576 100644 > --- i/src/backend/storage/ipc/procarray.c > +++ w/src/backend/storage/ipc/procarray.c > @@ -1663,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h) > TransactionId xmin; > > /* Fetch xid just once - see GetNewTransactionId */ > -xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]); > +xid = UINT32_ACCESS_ONCE(other_xids[index]); > xmin = UINT32_ACCESS_ONCE(proc->xmin); > > /* > > indeed fixes the issue based on a number of iterations of your modified > test, and fixes a clear bug. Pushed that much. > WRT better asserts: We could make ProcArrayRemove() and InitProcGlobal() > initialize currently unused procArray->pgprocnos, > procGlobal->{xids,subxidStates,vacuumFlags} to invalid values and/or > declare them as uninitialized using the valgrind helpers. > > For the first, one issue is that there's no obviously good candidate for > an uninitialized xid. We could use something like FrozenTransactionId, > which may never be in the procarray. But it's not exactly pretty. > > Opinions? So we get some builfarm results while thinking about this. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-08-16 13:52:58 -0700, Andres Freund wrote: > On 2020-08-16 13:31:53 -0700, Andres Freund wrote: > > I now luckily have a rr trace of the problem, so I hope I can narrow it > > down to the original problem fairly quickly. > > Gna, I think I see the problem. In at least one place I wrongly > accessed the 'dense' array of in-progress xids using the 'pgprocno', > instead of directly using the [0...procArray->numProcs) index. > > Working on a fix, together with some improved asserts. diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c index 8262abd42e6..96e4a878576 100644 --- i/src/backend/storage/ipc/procarray.c +++ w/src/backend/storage/ipc/procarray.c @@ -1663,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h) TransactionId xmin; /* Fetch xid just once - see GetNewTransactionId */ -xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]); +xid = UINT32_ACCESS_ONCE(other_xids[index]); xmin = UINT32_ACCESS_ONCE(proc->xmin); /* indeed fixes the issue based on a number of iterations of your modified test, and fixes a clear bug. WRT better asserts: We could make ProcArrayRemove() and InitProcGlobal() initialize currently unused procArray->pgprocnos, procGlobal->{xids,subxidStates,vacuumFlags} to invalid values and/or declare them as uninitialized using the valgrind helpers. For the first, one issue is that there's no obviously good candidate for an uninitialized xid. We could use something like FrozenTransactionId, which may never be in the procarray. But it's not exactly pretty. Opinions? Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-08-16 13:31:53 -0700, Andres Freund wrote: > I now luckily have a rr trace of the problem, so I hope I can narrow it > down to the original problem fairly quickly. Gna, I think I see the problem. In at least one place I wrongly accessed the 'dense' array of in-progress xids using the 'pgprocno', instead of directly using the [0...procArray->numProcs) index. Working on a fix, together with some improved asserts. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-08-16 16:17:23 -0400, Tom Lane wrote: > I wrote: > > It seems entirely likely that there's a timing component in this, for > > instance autovacuum coming along at just the right time. > > D'oh. The attached seems to make it 100% reproducible. Great! It interestingly didn't work as the first item on the schedule, where I had duplicated it it to out of impatience. I guess there might be some need of concurrent autovacuum activity or something like that. I now luckily have a rr trace of the problem, so I hope I can narrow it down to the original problem fairly quickly. Thanks, Andres
Re: Improving connection scalability: GetSnapshotData()
I wrote: > It seems entirely likely that there's a timing component in this, for > instance autovacuum coming along at just the right time. D'oh. The attached seems to make it 100% reproducible. regards, tom lane diff --git a/src/test/isolation/specs/freeze-the-dead.spec b/src/test/isolation/specs/freeze-the-dead.spec index 915bf15b92..4100d9fc6f 100644 --- a/src/test/isolation/specs/freeze-the-dead.spec +++ b/src/test/isolation/specs/freeze-the-dead.spec @@ -32,6 +32,7 @@ session "s2" step "s2_begin" { BEGIN; } step "s2_key_share" { SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; } step "s2_commit" { COMMIT; } +step "s2_wait" { select pg_sleep(60); } step "s2_vacuum" { VACUUM FREEZE tab_freeze; } session "s3" @@ -49,6 +50,7 @@ permutation "s1_begin" "s2_begin" "s3_begin" # start transactions "s1_update" "s2_key_share" "s3_key_share" # have xmax be a multi with an updater, updater being oldest xid "s1_update" # create additional row version that has multis "s1_commit" "s2_commit" # commit both updater and share locker + "s2_wait" "s2_vacuum" # due to bug in freezing logic, we used to *not* prune updated row, and then froze it "s1_selectone" # if hot chain is broken, the row can't be found via index scan "s3_commit" # commit remaining open xact
Re: Improving connection scalability: GetSnapshotData()
On 2020-08-16 14:30:24 -0400, Tom Lane wrote: > Andres Freund writes: > > 690 successful runs later, it didn't trigger for me :(. Seems pretty > > clear that there's another variable than pure chance, otherwise it seems > > like that number of runs should have hit the issue, given the number of > > bf hits vs bf runs. > > It seems entirely likely that there's a timing component in this, for > instance autovacuum coming along at just the right time. It's not too > surprising that some machines would be more prone to show that than > others. (Note peripatus is FreeBSD, which we've already learned has > significantly different kernel scheduler behavior than Linux.) Yea. Interestingly there was a reproduction on linux since the initial reports you'd dug up: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=butterflyfish=2020-08-15%2019%3A54%3A53 but that's likely a virtualized environment, so I guess the host scheduler behaviour could play a similar role. I'll run a few iterations with rr's chaos mode too, which tries to randomize scheduling decisions... I noticed that it's quite hard to actually hit the hot tuple path I mentioned earlier on my machine. Would probably be good to have a tests hitting it more reliably. But I'm not immediately seeing how we could force the necessarily serialization. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Andres Freund writes: > 690 successful runs later, it didn't trigger for me :(. Seems pretty > clear that there's another variable than pure chance, otherwise it seems > like that number of runs should have hit the issue, given the number of > bf hits vs bf runs. It seems entirely likely that there's a timing component in this, for instance autovacuum coming along at just the right time. It's not too surprising that some machines would be more prone to show that than others. (Note peripatus is FreeBSD, which we've already learned has significantly different kernel scheduler behavior than Linux.) > My current plan would is to push a bit of additional instrumentation to > help narrow down the issue. +1 regards, tom lane
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-08-15 09:42:00 -0700, Andres Freund wrote: > On 2020-08-15 11:10:51 -0400, Tom Lane wrote: > > We have two essentially identical buildfarm failures since these patches > > went in: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly=2020-08-15%2011%3A27%3A32 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus=2020-08-15%2003%3A09%3A14 > > > > They're both in the same place in the freeze-the-dead isolation test: > > > TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", > > File: "heapam.c", Line: 6051) > > 0x9613eb at > > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > > 0x52d586 at > > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > > 0x53bc7e at > > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > > 0x6949bb at > > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > > 0x694532 at > > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > > 0x693d1c at > > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > > 0x8324b3 > > ... > > 2020-08-14 22:16:41.783 CDT [78410:4] LOG: server process (PID 80395) was > > terminated by signal 6: Abort trap > > 2020-08-14 22:16:41.783 CDT [78410:5] DETAIL: Failed process was running: > > VACUUM FREEZE tab_freeze; > > > > peripatus has successes since this failure, so it's not fully reproducible > > on that machine. I'm suspicious of a timing problem in computing vacuum's > > cutoff_xid. > > Hm, maybe it's something around what I observed in > https://www.postgresql.org/message-id/20200723181018.neey2jd3u7rfrfrn%40alap3.anarazel.de > > I.e. that somehow we end up with hot pruning and freezing coming to a > different determination, and trying to freeze a hot tuple. > > I'll try to add a few additional asserts here, and burn some cpu tests > trying to trigger the issue. > > I gotta escape the heat in the house for a few hours though (no AC > here), so I'll not look at the results till later this afternoon, unless > it triggers soon. 690 successful runs later, it didn't trigger for me :(. Seems pretty clear that there's another variable than pure chance, otherwise it seems like that number of runs should have hit the issue, given the number of bf hits vs bf runs. My current plan would is to push a bit of additional instrumentation to help narrow down the issue. We can afterwards decide what of that we'd like to keep longer term, and what not. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-08-15 11:10:51 -0400, Tom Lane wrote: > We have two essentially identical buildfarm failures since these patches > went in: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly=2020-08-15%2011%3A27%3A32 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus=2020-08-15%2003%3A09%3A14 > > They're both in the same place in the freeze-the-dead isolation test: > TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", > File: "heapam.c", Line: 6051) > 0x9613eb at > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > 0x52d586 at > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > 0x53bc7e at > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > 0x6949bb at > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > 0x694532 at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > 0x693d1c at > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres > 0x8324b3 > ... > 2020-08-14 22:16:41.783 CDT [78410:4] LOG: server process (PID 80395) was > terminated by signal 6: Abort trap > 2020-08-14 22:16:41.783 CDT [78410:5] DETAIL: Failed process was running: > VACUUM FREEZE tab_freeze; > > peripatus has successes since this failure, so it's not fully reproducible > on that machine. I'm suspicious of a timing problem in computing vacuum's > cutoff_xid. Hm, maybe it's something around what I observed in https://www.postgresql.org/message-id/20200723181018.neey2jd3u7rfrfrn%40alap3.anarazel.de I.e. that somehow we end up with hot pruning and freezing coming to a different determination, and trying to freeze a hot tuple. I'll try to add a few additional asserts here, and burn some cpu tests trying to trigger the issue. I gotta escape the heat in the house for a few hours though (no AC here), so I'll not look at the results till later this afternoon, unless it triggers soon. > (I'm also wondering why the failing check is an Assert rather than a real > test-and-elog. Assert doesn't seem like an appropriate way to check for > plausible data corruption cases.) Robert, and to a lesser degree you, gave me quite a bit of grief over converting nearby asserts to elogs. I agree it'd be better if it were an assert, but ... Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
We have two essentially identical buildfarm failures since these patches went in: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly=2020-08-15%2011%3A27%3A32 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus=2020-08-15%2003%3A09%3A14 They're both in the same place in the freeze-the-dead isolation test: TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", File: "heapam.c", Line: 6051) 0x9613eb at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres 0x52d586 at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres 0x53bc7e at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres 0x6949bb at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres 0x694532 at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres 0x693d1c at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres 0x8324b3 ... 2020-08-14 22:16:41.783 CDT [78410:4] LOG: server process (PID 80395) was terminated by signal 6: Abort trap 2020-08-14 22:16:41.783 CDT [78410:5] DETAIL: Failed process was running: VACUUM FREEZE tab_freeze; peripatus has successes since this failure, so it's not fully reproducible on that machine. I'm suspicious of a timing problem in computing vacuum's cutoff_xid. (I'm also wondering why the failing check is an Assert rather than a real test-and-elog. Assert doesn't seem like an appropriate way to check for plausible data corruption cases.) regards, tom lane
Re: Improving connection scalability: GetSnapshotData()
On Wed, Aug 12, 2020 at 12:19 PM Andres Freund wrote: > On 2020-07-29 19:20:04 +1200, Thomas Munro wrote: > > On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro wrote: > > > +static inline FullTransactionId > > > +FullXidViaRelative(FullTransactionId rel, TransactionId xid) > > > > > > I'm struggling to find a better word for this than "relative". > > > > The best I've got is "anchor" xid. It is an xid that is known to > > limit nextFullXid's range while the receiving function runs. > > Thinking about it, I think that relative is a good descriptor. It's just > that 'via' is weird. How about: FullXidRelativeTo? WFM.
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-07-29 19:20:04 +1200, Thomas Munro wrote: > On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro wrote: > > +static inline FullTransactionId > > +FullXidViaRelative(FullTransactionId rel, TransactionId xid) > > > > I'm struggling to find a better word for this than "relative". > > The best I've got is "anchor" xid. It is an xid that is known to > limit nextFullXid's range while the receiving function runs. Thinking about it, I think that relative is a good descriptor. It's just that 'via' is weird. How about: FullXidRelativeTo? Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
> On 24 Jul 2020, at 03:11, Andres Freund wrote: > I've done that in the attached. As this is actively being reviewed but time is running short, I'm moving this to the next CF. cheers ./daniel
Re: Improving connection scalability: GetSnapshotData()
On Wed, Jul 29, 2020 at 6:15 PM Thomas Munro wrote: > +static inline FullTransactionId > +FullXidViaRelative(FullTransactionId rel, TransactionId xid) > > I'm struggling to find a better word for this than "relative". The best I've got is "anchor" xid. It is an xid that is known to limit nextFullXid's range while the receiving function runs.
Re: Improving connection scalability: GetSnapshotData()
On Fri, Jul 24, 2020 at 1:11 PM Andres Freund wrote: > On 2020-07-15 21:33:06 -0400, Alvaro Herrera wrote: > > On 2020-Jul-15, Andres Freund wrote: > > > It could make sense to split the conversion of > > > VariableCacheData->latestCompletedXid to FullTransactionId out from 0001 > > > into is own commit. Not sure... > > > > +1, the commit is large enough and that change can be had in advance. > > I've done that in the attached. + * pair with the memory barrier below. We do however accept xid to be <= + * to next_xid, instead of just <, as xid could be from the procarray, + * before we see the updated nextFullXid value. Tricky. Right, that makes sense. I like the range assertion. +static inline FullTransactionId +FullXidViaRelative(FullTransactionId rel, TransactionId xid) I'm struggling to find a better word for this than "relative". +return FullTransactionIdFromU64(U64FromFullTransactionId(rel) ++ (int32) (xid - rel_xid)); I like your branch-free code for this. > I wonder if somebody has an opinion on renaming latestCompletedXid to > latestCompletedFullXid. That's the pattern we already had (cf > nextFullXid), but it also leads to pretty long lines and quite a few > comment etc changes. > > I'm somewhat inclined to remove the "Full" out of the variable, and to > also do that for nextFullXid. I feel like including it in the variable > name is basically a poor copy of the (also not great) C type system. If > we hadn't made FullTransactionId a struct I'd see it differently (and > thus incompatible with TransactionId), but we have ... Yeah, I'm OK with dropping the "Full". I've found it rather clumsy too.
Re: Improving connection scalability: GetSnapshotData()
Em sex., 24 de jul. de 2020 às 21:00, Andres Freund escreveu: > On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote: > > Em sex., 24 de jul. de 2020 às 14:16, Andres Freund > > escreveu: > > > > > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote: > > > > Latest Postgres > > > > Windows 64 bits > > > > msvc 2019 64 bits > > > > > > > > Patches applied v12-0001 to v12-0007: > > > > > > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning > > > C4013: > > > > 'GetOldestXmin' indefinido; assumindo extern retornando int > > > > [C:\dll\postgres > > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): > warning > > > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int > > > > [C:\dll\postgres\pg_visibility. > > > > vcxproj] > > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error > C2065: > > > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > > [C:\dll\postgres\pgstattuple.vcxproj] > > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): > error > > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > > [C:\dll\postgres\pg_visibility.vcxproj] > > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): > error > > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > > [C:\dll\postgres\pg_visibility.vcxproj] > > > > > > I don't know that's about - there's no call to GetOldestXmin() in > > > pgstatapprox and pg_visibility after patch 0002? And similarly, the > > > PROCARRAY_* references are also removed in the same patch? > > > > > Maybe need to remove them from these places, not? > > C:\dll\postgres\contrib>grep -d GetOldestXmin *.c > > File pgstattuple\pgstatapprox.c: > > OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM); > > File pg_visibility\pg_visibility.c: > > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); > > * deadlocks, because surely > > GetOldestXmin() should never take > > RecomputedOldestXmin = > GetOldestXmin(NULL, > > PROCARRAY_FLAGS_VACUUM); > > The 0002 patch changed those files: > > diff --git a/contrib/pg_visibility/pg_visibility.c > b/contrib/pg_visibility/pg_visibility.c > index 68d580ed1e0..37206c50a21 100644 > --- a/contrib/pg_visibility/pg_visibility.c > +++ b/contrib/pg_visibility/pg_visibility.c > @@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, > bool all_frozen) > BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); > TransactionId OldestXmin = InvalidTransactionId; > > - if (all_visible) > - { > - /* Don't pass rel; that will fail in recovery. */ > - OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); > - } > - > rel = relation_open(relid, AccessShareLock); > > /* Only some relkinds have a visibility map */ > check_relation_relkind(rel); > > + if (all_visible) > + OldestXmin = GetOldestNonRemovableTransactionId(rel); > + > nblocks = RelationGetNumberOfBlocks(rel); > > /* > @@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, > bool all_frozen) > * From a concurrency point of view, it > sort of sucks to > * retake ProcArrayLock here while we're > holding the buffer > * exclusively locked, but it should be > safe against > -* deadlocks, because surely > GetOldestXmin() should never take > -* a buffer lock. And this shouldn't > happen often, so it's > -* worth being careful so as to avoid > false positives. > +* deadlocks, because surely > GetOldestNonRemovableTransactionId() > +* should never take a buffer lock. And > this shouldn't happen > +* often, so it's worth being careful so > as to avoid false > +* positives. > */ > - RecomputedOldestXmin = GetOldestXmin(NULL, > PROCARRAY_FLAGS_VACUUM); > + RecomputedOldestXmin = > GetOldestNonRemovableTransactionId(rel); > > if (!TransactionIdPrecedes(OldestXmin, > RecomputedOldestXmin)) > record_corrupt_item(items, > _self); > > diff --git a/contrib/pgstattuple/pgstatapprox.c > b/contrib/pgstattuple/pgstatapprox.c > index dbc0fa11f61..3a99333d443 100644 > --- a/contrib/pgstattuple/pgstatapprox.c > +++ b/contrib/pgstattuple/pgstatapprox.c > @@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat) > BufferAccessStrategy bstrategy; > TransactionId OldestXmin; > > - OldestXmin = GetOldestXmin(rel,
Re: Improving connection scalability: GetSnapshotData()
On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote: > Em sex., 24 de jul. de 2020 às 14:16, Andres Freund > escreveu: > > > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote: > > > Latest Postgres > > > Windows 64 bits > > > msvc 2019 64 bits > > > > > > Patches applied v12-0001 to v12-0007: > > > > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning > > C4013: > > > 'GetOldestXmin' indefinido; assumindo extern retornando int > > > [C:\dll\postgres > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning > > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int > > > [C:\dll\postgres\pg_visibility. > > > vcxproj] > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065: > > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > [C:\dll\postgres\pgstattuple.vcxproj] > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > [C:\dll\postgres\pg_visibility.vcxproj] > > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error > > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > > [C:\dll\postgres\pg_visibility.vcxproj] > > > > I don't know that's about - there's no call to GetOldestXmin() in > > pgstatapprox and pg_visibility after patch 0002? And similarly, the > > PROCARRAY_* references are also removed in the same patch? > > > Maybe need to remove them from these places, not? > C:\dll\postgres\contrib>grep -d GetOldestXmin *.c > File pgstattuple\pgstatapprox.c: > OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM); > File pg_visibility\pg_visibility.c: > OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); > * deadlocks, because surely > GetOldestXmin() should never take > RecomputedOldestXmin = GetOldestXmin(NULL, > PROCARRAY_FLAGS_VACUUM); The 0002 patch changed those files: diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c index 68d580ed1e0..37206c50a21 100644 --- a/contrib/pg_visibility/pg_visibility.c +++ b/contrib/pg_visibility/pg_visibility.c @@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); TransactionId OldestXmin = InvalidTransactionId; - if (all_visible) - { - /* Don't pass rel; that will fail in recovery. */ - OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); - } - rel = relation_open(relid, AccessShareLock); /* Only some relkinds have a visibility map */ check_relation_relkind(rel); + if (all_visible) + OldestXmin = GetOldestNonRemovableTransactionId(rel); + nblocks = RelationGetNumberOfBlocks(rel); /* @@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) * From a concurrency point of view, it sort of sucks to * retake ProcArrayLock here while we're holding the buffer * exclusively locked, but it should be safe against -* deadlocks, because surely GetOldestXmin() should never take -* a buffer lock. And this shouldn't happen often, so it's -* worth being careful so as to avoid false positives. +* deadlocks, because surely GetOldestNonRemovableTransactionId() +* should never take a buffer lock. And this shouldn't happen +* often, so it's worth being careful so as to avoid false +* positives. */ - RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); + RecomputedOldestXmin = GetOldestNonRemovableTransactionId(rel); if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin)) record_corrupt_item(items, _self); diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c index dbc0fa11f61..3a99333d443 100644 --- a/contrib/pgstattuple/pgstatapprox.c +++ b/contrib/pgstattuple/pgstatapprox.c @@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat) BufferAccessStrategy bstrategy; TransactionId OldestXmin; - OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM); + OldestXmin = GetOldestNonRemovableTransactionId(rel); bstrategy = GetAccessStrategy(BAS_BULKREAD); nblocks = RelationGetNumberOfBlocks(rel); Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Em sex., 24 de jul. de 2020 às 14:16, Andres Freund escreveu: > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote: > > Latest Postgres > > Windows 64 bits > > msvc 2019 64 bits > > > > Patches applied v12-0001 to v12-0007: > > > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning > C4013: > > 'GetOldestXmin' indefinido; assumindo extern retornando int > > [C:\dll\postgres > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int > > [C:\dll\postgres\pg_visibility. > > vcxproj] > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065: > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > [C:\dll\postgres\pgstattuple.vcxproj] > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > [C:\dll\postgres\pg_visibility.vcxproj] > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > > [C:\dll\postgres\pg_visibility.vcxproj] > > I don't know that's about - there's no call to GetOldestXmin() in > pgstatapprox and pg_visibility after patch 0002? And similarly, the > PROCARRAY_* references are also removed in the same patch? > Maybe need to remove them from these places, not? C:\dll\postgres\contrib>grep -d GetOldestXmin *.c File pgstattuple\pgstatapprox.c: OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM); File pg_visibility\pg_visibility.c: OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); * deadlocks, because surely GetOldestXmin() should never take RecomputedOldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM); regards, Ranier Vilela
Re: Improving connection scalability: GetSnapshotData()
On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote: > Latest Postgres > Windows 64 bits > msvc 2019 64 bits > > Patches applied v12-0001 to v12-0007: > > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013: > 'GetOldestXmin' indefinido; assumindo extern retornando int > [C:\dll\postgres > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int > [C:\dll\postgres\pg_visibility. > vcxproj] > C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065: > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > [C:\dll\postgres\pgstattuple.vcxproj] > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > [C:\dll\postgres\pg_visibility.vcxproj] > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado > [C:\dll\postgres\pg_visibility.vcxproj] I don't know that's about - there's no call to GetOldestXmin() in pgstatapprox and pg_visibility after patch 0002? And similarly, the PROCARRAY_* references are also removed in the same patch? Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Latest Postgres Windows 64 bits msvc 2019 64 bits Patches applied v12-0001 to v12-0007: C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int [C:\dll\postgres C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int [C:\dll\postgres\pg_visibility. vcxproj] C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado [C:\dll\postgres\pgstattuple.vcxproj] C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado [C:\dll\postgres\pg_visibility.vcxproj] C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado [C:\dll\postgres\pg_visibility.vcxproj] regards, Ranier Vilela
Re: Improving connection scalability: GetSnapshotData()
On 2020-Jul-15, Andres Freund wrote: > It could make sense to split the conversion of > VariableCacheData->latestCompletedXid to FullTransactionId out from 0001 > into is own commit. Not sure... +1, the commit is large enough and that change can be had in advance. Note you forward-declare struct GlobalVisState twice in heapam.h. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improving connection scalability: GetSnapshotData()
This patch no longer applies to HEAD, please submit a rebased version. I've marked it Waiting on Author in the meantime. cheers ./daniel
Re: Improving connection scalability: GetSnapshotData()
Hello, hackers. Andres, nice work! Sorry for the off-top. Some of my work [1] related to the support of index hint bits on standby is highly interfering with this patch. Is it safe to consider it committed and start rebasing on top of the patches? Thanks, Michail. [1]: https://www.postgresql.org/message-id/CANtu0ojmkN_6P7CQWsZ%3DuEgeFnSmpCiqCxyYaHnhYpTZHj7Ubw%40mail.gmail.com
Re: Improving connection scalability: GetSnapshotData()
On Wed, Apr 08, 2020 at 03:17:41PM -0700, Andres Freund wrote: > On 2020-04-08 09:26:42 -0400, Jonathan S. Katz wrote: >> Lastly, with the ongoing world events, perhaps time that could have been >> dedicated to this and other patches likely affected their completion. I >> know most things in my life take way longer than they used to (e.g. >> taking out the trash/recycles has gone from a 15s to 240s routine). The >> same could be said about other patches as well, but this one has a far >> greater impact (a double-edged sword, of course) given it's a feature >> that everyone uses in PostgreSQL ;) > > I'm obviously not alone in that, so I agree that it's not an argument > pro/con anything. > > But this definitely is the case for me. Leaving aside the general dread, > not having a quiet home-office, nor good exercise, is definitely not > helping. Another factor to be careful of is that by committing a new feature in a release cycle, you actually need to think about the extra amount of resources you may need to address comments and issues about it in time during the beta/stability period, and that more care is likely needed if you commit something at the end of the cycle. On top of that, currently, that's a bit hard to plan one or two weeks ahead if help is needed to stabilize something you worked on. I am pretty sure that we'll be able to sort things out with a collective effort though. -- Michael signature.asc Description: PGP signature
Re: Improving connection scalability: GetSnapshotData()
On Wed, Apr 8, 2020 at 03:25:34PM -0700, Andres Freund wrote: > Hi, > > On 2020-04-08 18:06:23 -0400, Bruce Momjian wrote: > > If we don't commit this, where does this leave us with the > > old_snapshot_threshold feature? We remove it in back branches and have > > no working version in PG 13? That seems kind of bad. > > I don't think this patch changes the situation for > old_snapshot_threshold in a meaningful way. > > Sure, this patch makes old_snapshot_threshold scale better, and triggers > fewer unnecessary query cancellations. But there still are wrong query > results, the tests still don't test anything meaningful, and the > determination of which query is cancelled is still wrong. Oh, OK, so it still needs to be disabled. I was hoping we could paint this as a fix. Based on Robert's analysis of the risk (no SQL syntax, no storage changes), I think, if you are willing to keep working at this until the final release, it is reasonable to apply it. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-08 18:06:23 -0400, Bruce Momjian wrote: > If we don't commit this, where does this leave us with the > old_snapshot_threshold feature? We remove it in back branches and have > no working version in PG 13? That seems kind of bad. I don't think this patch changes the situation for old_snapshot_threshold in a meaningful way. Sure, this patch makes old_snapshot_threshold scale better, and triggers fewer unnecessary query cancellations. But there still are wrong query results, the tests still don't test anything meaningful, and the determination of which query is cancelled is still wrong. - Andres
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-08 09:44:16 -0400, Robert Haas wrote: > Moreover, shakedown time will be minimal because we're so late in the > release cycle My impression increasingly is that there's very little actual shakedown before beta :(. As e.g. evidenced by the fact that 2PC did basically not work for several months until I did new benchmarks for this patch. I don't know what to do about that, but... Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-08 09:26:42 -0400, Jonathan S. Katz wrote: > On 4/8/20 8:59 AM, Alexander Korotkov wrote: > > On Wed, Apr 8, 2020 at 3:43 PM Andres Freund wrote: > >> Realistically it still 2-3 hours of proof-reading. > >> > >> This makes me sad :( > > > > Can we ask RMT to extend feature freeze for this particular patchset? > > I think it's reasonable assuming extreme importance of this patchset. > One of the features of RMT responsibilities[1] is to be "hands off" as > much as possible, so perhaps a reverse ask: how would people feel about > this patch going into PG13, knowing that the commit would come after the > feature freeze date? I'm obviously biased, so I don't think there's much point in responding directly to that question. But I thought it could be helpful if I described what my thoughts about where the patchset is: What made me not commit it "earlier" yesterday was not that I had/have any substantial concerns about the technical details of the patch. But that there were a few too many comments that didn't yet sound quite right, that the commit messages didn't yet explain the architecture / benefits well enough, and that I noticed that a few variable names were too easy to be misunderstood by others. By 5 AM I had addressed most of that, except that some technical details weren't yet mentioned in the commit messages ([1], they are documented in the code). I also produce enough typos / odd grammar when fully awake, so even though I did proof read my changes, I thought that I need to do that again while awake. There have been no substantial code changes since yesterday. The variable renaming prompted by Robert (which I agree is an improvement), as well as reducing the diff size by deferring some readability improvements (probably also a good idea) did however produce quite a few conflicts in subsequent patches that I needed to resolve. Another awake read-through to confirm that I resolved them correctly seemed the responsible thing to do before a commit. > Lastly, with the ongoing world events, perhaps time that could have been > dedicated to this and other patches likely affected their completion. I > know most things in my life take way longer than they used to (e.g. > taking out the trash/recycles has gone from a 15s to 240s routine). The > same could be said about other patches as well, but this one has a far > greater impact (a double-edged sword, of course) given it's a feature > that everyone uses in PostgreSQL ;) I'm obviously not alone in that, so I agree that it's not an argument pro/con anything. But this definitely is the case for me. Leaving aside the general dread, not having a quiet home-office, nor good exercise, is definitely not helping. I'm really bummed that I didn't have the cycles to help the shared memory stats patch ready as well. It's clearly not yet there (but improved a lot during the CF). But it's been around for so long, and there's so many improvements blocked by the current stats infrastructure... [1] the "mirroring" of values beteween dense arrays and PGPROC, the changed locking regimen for ProcArrayAdd/Remove, the widening of lastCompletedXid to be a 64bit xid [2] https://www.postgresql.org/message-id/20200407121503.zltbpqmdesurflnm%40alap3.anarazel.de Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On Wed, Apr 8, 2020 at 09:44:16AM -0400, Robert Haas wrote: > I don't know what the right thing to do is. I agree with everyone who > says this is a very important problem, and I have the highest respect > for Andres's technical ability. On the other hand, I have been around > here long enough to know that deciding whether to allow late commits > on the basis of how much we like the feature is a bad plan, because it > takes into account only the upside of a commit, and ignores the > possible downside risk. Typically, the commit is late because the > feature was rushed to completion at the last minute, which can have an > effect on quality. I can say, having read through the patches > yesterday, that they don't suck, but I can't say that they're fully > correct. That's not to say that we shouldn't decide to take them, but > it is a concern to be taken seriously. We have made mistakes before in > what we shipped that had serious implications for many users and for > the project; we should all be wary of making more such mistakes. I am > not trying to say that solving problems and making stuff better is NOT > important, just that every coin has two sides. If we don't commit this, where does this leave us with the old_snapshot_threshold feature? We remove it in back branches and have no working version in PG 13? That seems kind of bad. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-08 09:24:13 -0400, Robert Haas wrote: > On Tue, Apr 7, 2020 at 4:27 PM Andres Freund wrote: > > The main reason is that we want to be able to cheaply check the current > > state of the variables (mostly when checking a backend's own state). We > > can't access the "dense" ones without holding a lock, but we e.g. don't > > want to make ProcArrayEndTransactionInternal() take a lock just to check > > if vacuumFlags is set. > > > > It turns out to also be good for performance to have the copy for > > another reason: The "dense" arrays share cachelines with other > > backends. That's worth it because it allows to make GetSnapshotData(), > > by far the most frequent operation, touch fewer cache lines. But it also > > means that it's more likely that a backend's "dense" array entry isn't > > in a local cpu cache (it'll be pulled out of there when modified in > > another backend). In many cases we don't need the shared entry at commit > > etc time though, we just need to check if it is set - and most of the > > time it won't be. The local entry allows to do that cheaply. > > > > Basically it makes sense to access the PGPROC variable when checking a > > single backend's data, especially when we have to look at the PGPROC for > > other reasons already. It makes sense to look at the "dense" arrays if > > we need to look at many / most entries, because we then benefit from the > > reduced indirection and better cross-process cacheability. > > That's a good explanation. I think it should be in the comments or a > README somewhere. I had a briefer version in the PROC_HDR comment. I've just expanded it to: * * The denser separate arrays are beneficial for three main reasons: First, to * allow for as tight loops accessing the data as possible. Second, to prevent * updates of frequently changing data (e.g. xmin) from invalidating * cachelines also containing less frequently changing data (e.g. xid, * vacuumFlags). Third to condense frequently accessed data into as few * cachelines as possible. * * There are two main reasons to have the data mirrored between these dense * arrays and PGPROC. First, as explained above, a PGPROC's array entries can * only be accessed with either ProcArrayLock or XidGenLock held, whereas the * PGPROC entries do not require that (obviously there may still be locking * requirements around the individual field, separate from the concerns * here). That is particularly important for a backend to efficiently checks * it own values, which it often can safely do without locking. Second, the * PGPROC fields allow to avoid unnecessary accesses and modification to the * dense arrays. A backend's own PGPROC is more likely to be in a local cache, * whereas the cachelines for the dense array will be modified by other * backends (often removing it from the cache for other cores/sockets). At * commit/abort time a check of the PGPROC value can avoid accessing/dirtying * the corresponding array value. * * Basically it makes sense to access the PGPROC variable when checking a * single backend's data, especially when already looking at the PGPROC for * other reasons already. It makes sense to look at the "dense" arrays if we * need to look at many / most entries, because we then benefit from the * reduced indirection and better cross-process cache-ability. * * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), the data * in the dense arrays is initialized from the PGPROC while it already holds Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
On Wed, Apr 8, 2020 at 9:27 AM Jonathan S. Katz wrote: > One of the features of RMT responsibilities[1] is to be "hands off" as > much as possible, so perhaps a reverse ask: how would people feel about > this patch going into PG13, knowing that the commit would come after the > feature freeze date? Letting something be committed after feature freeze, or at any other time, is just a risk vs. reward trade-off. Every patch carries some chance of breaking stuff or making things worse. And every patch has a chance of making something better that people care about. On general principle, I would categorize this as a moderate-risk patch. It doesn't change SQL syntax like, e.g. MERGE, nor does it touch the on-disk format, like, e.g. INSERT .. ON CONFLICT UPDATE. The changes are relatively localized, unlike, e.g. parallel query. Those are all things that reduce risk. On the other hand, it's a brand new patch which has not been thoroughly reviewed by anyone. Moreover, shakedown time will be minimal because we're so late in the release cycle. if it has subtle synchronization problems or if it regresses performance badly in some cases, we might not find out about any of that until after release. While in theory we could revert it any time, since no SQL syntax or on-disk format is affected, in practice it will be difficult to do that if it's making life better for some people and worse for others. I don't know what the right thing to do is. I agree with everyone who says this is a very important problem, and I have the highest respect for Andres's technical ability. On the other hand, I have been around here long enough to know that deciding whether to allow late commits on the basis of how much we like the feature is a bad plan, because it takes into account only the upside of a commit, and ignores the possible downside risk. Typically, the commit is late because the feature was rushed to completion at the last minute, which can have an effect on quality. I can say, having read through the patches yesterday, that they don't suck, but I can't say that they're fully correct. That's not to say that we shouldn't decide to take them, but it is a concern to be taken seriously. We have made mistakes before in what we shipped that had serious implications for many users and for the project; we should all be wary of making more such mistakes. I am not trying to say that solving problems and making stuff better is NOT important, just that every coin has two sides. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improving connection scalability: GetSnapshotData()
On 4/8/20 8:59 AM, Alexander Korotkov wrote: > On Wed, Apr 8, 2020 at 3:43 PM Andres Freund wrote: >> Realistically it still 2-3 hours of proof-reading. >> >> This makes me sad :( > > Can we ask RMT to extend feature freeze for this particular patchset? > I think it's reasonable assuming extreme importance of this patchset. One of the features of RMT responsibilities[1] is to be "hands off" as much as possible, so perhaps a reverse ask: how would people feel about this patch going into PG13, knowing that the commit would come after the feature freeze date? My 2¢, with RMT hat off: As mentioned earlier[2], we know that connection scalability is a major pain point with PostgreSQL and any effort that can help alleviate that is a huge win, even in incremental gains. Andres et al experimentation show that this is more than incremental gains, and will certainly make a huge difference in people's PostgreSQL experience. It is one of those features where you can "plug in and win" -- you get a performance benefit just by upgrading. That is not insignificant. However, I also want to ensure that we are fair: in the past there have also been other patches that have been "oh-so-close" to commit before feature freeze but have not made it in (an example escapes me at the moment). Therefore, we really need to have consensus among ourselves that the right decision is to allow this to go in after feature freeze. Did this come in (very) late into the development cycle? Yes, and I think normally that's enough to give cause for pause. But I could also argue that Andres is fixing a "bug" with PostgreSQL (probably several bugs ;) with PostgreSQL -- and perhaps the fixes can't be backpatched per se, but they do improve the overall stability and usability of PostgreSQL and it'd be a shame if we have to wait on them. Lastly, with the ongoing world events, perhaps time that could have been dedicated to this and other patches likely affected their completion. I know most things in my life take way longer than they used to (e.g. taking out the trash/recycles has gone from a 15s to 240s routine). The same could be said about other patches as well, but this one has a far greater impact (a double-edged sword, of course) given it's a feature that everyone uses in PostgreSQL ;) So with my RMT hat off, I say +1 to allowing this post feature freeze, though within a reasonable window. My 2¢, with RMT hat on: I believe in[2] I outlined a way a path for including the patch even at this stage in the game. If it is indeed committed, I think we immediately put it on the "Recheck a mid-Beta" list. I know it's not as trivial to change as something like "Determine if jit="on" by default" (not picking on Andres, I just remember that example from RMT 11), but it at least provides a notable reminder that we need to ensure we test this thoroughly, and point people to really hammer it during beta. So with my RMT hat on, I say +0 but with a ;) Thanks, Jonathan [1] https://wiki.postgresql.org/wiki/Release_Management_Team#History [2] https://www.postgresql.org/message-id/6be8c321-68ea-a865-d8d0-50a3af616463%40postgresql.org signature.asc Description: OpenPGP digital signature
Re: Improving connection scalability: GetSnapshotData()
On Tue, Apr 7, 2020 at 4:27 PM Andres Freund wrote: > The main reason is that we want to be able to cheaply check the current > state of the variables (mostly when checking a backend's own state). We > can't access the "dense" ones without holding a lock, but we e.g. don't > want to make ProcArrayEndTransactionInternal() take a lock just to check > if vacuumFlags is set. > > It turns out to also be good for performance to have the copy for > another reason: The "dense" arrays share cachelines with other > backends. That's worth it because it allows to make GetSnapshotData(), > by far the most frequent operation, touch fewer cache lines. But it also > means that it's more likely that a backend's "dense" array entry isn't > in a local cpu cache (it'll be pulled out of there when modified in > another backend). In many cases we don't need the shared entry at commit > etc time though, we just need to check if it is set - and most of the > time it won't be. The local entry allows to do that cheaply. > > Basically it makes sense to access the PGPROC variable when checking a > single backend's data, especially when we have to look at the PGPROC for > other reasons already. It makes sense to look at the "dense" arrays if > we need to look at many / most entries, because we then benefit from the > reduced indirection and better cross-process cacheability. That's a good explanation. I think it should be in the comments or a README somewhere. > How about: > /* > * If the current xactCompletionCount is still the same as it was at > the > * time the snapshot was built, we can be sure that rebuilding the > * contents of the snapshot the hard way would result in the same > snapshot > * contents: > * > * As explained in transam/README, the set of xids considered running > by > * GetSnapshotData() cannot change while ProcArrayLock is held. > Snapshot > * contents only depend on transactions with xids and > xactCompletionCount > * is incremented whenever a transaction with an xid finishes (while > * holding ProcArrayLock) exclusively). Thus the xactCompletionCount > check > * ensures we would detect if the snapshot would have changed. > * > * As the snapshot contents are the same as it was before, it is is > safe > * to re-enter the snapshot's xmin into the PGPROC array. None of the > rows > * visible under the snapshot could already have been removed (that'd > * require the set of running transactions to change) and it fulfills > the > * requirement that concurrent GetSnapshotData() calls yield the same > * xmin. > */ That's nice and clear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improving connection scalability: GetSnapshotData()
On Wed, Apr 8, 2020 at 3:43 PM Andres Freund wrote: > Realistically it still 2-3 hours of proof-reading. > > This makes me sad :( Can we ask RMT to extend feature freeze for this particular patchset? I think it's reasonable assuming extreme importance of this patchset. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-07 16:13:07 -0400, Robert Haas wrote: > On Tue, Apr 7, 2020 at 3:24 PM Andres Freund wrote: > > > + ProcGlobal->xids[pgxactoff] = InvalidTransactionId; > > > > > > Apparently this array is not dense in the sense that it excludes > > > unused slots, but comments elsewhere don't seem to entirely agree. > > > > What do you mean with "unused slots"? Backends that committed? > > Backends that have no XID. You mean, I guess, that it is "dense" in > the sense that only live backends are in there, not "dense" in the > sense that only active write transactions are in there. Correct. I tried the "only active write transaction" approach, btw, and had a hard time making it scale well (due to the much more frequent moving of entries at commit/abort time). If we were to go to a 'only active transactions' array at some point we'd imo still need pretty much all the other changes made here - so I'm not worried about it for now. > > > + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + > > > max_prepared_xacts; > > > > > > /* ProcGlobal */ > > > size = add_size(size, sizeof(PROC_HDR)); > > > - /* MyProcs, including autovacuum workers and launcher */ > > > - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC))); > > > - /* AuxiliaryProcs */ > > > - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC))); > > > - /* Prepared xacts */ > > > - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC))); > > > - /* ProcStructLock */ > > > + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC))); > > > > > > This seems like a bad idea. If we establish a precedent that it's OK > > > to have sizing routines that don't use add_size() and mul_size(), > > > people are going to cargo cult that into places where there is more > > > risk of overflow than there is here. > > > > Hm. I'm not sure I see the problem. Are you concerned that TotalProcs > > would overflow due to too big MaxBackends or max_prepared_xacts? The > > multiplication itself is still protected by add_size(). It didn't seem > > correct to use add_size for the TotalProcs addition, since that's not > > really a size. And since the limit for procs is much lower than > > UINT32_MAX... > > I'm concerned that there are 0 uses of add_size in any shared-memory > sizing function, and I think it's best to keep it that way. I can't make sense of that sentence? We already have code like this, and have for a long time: /* Size of the ProcArray structure itself */ #define PROCARRAY_MAXPROCS (MaxBackends + max_prepared_xacts) adding NUM_AUXILIARY_PROCS doesn't really change that, does it? > If you initialize TotalProcs = add_size(MaxBackends, > add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy. Will do. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-07 15:26:36 -0400, Robert Haas wrote: > 0008 - > > Here again, I greatly dislike putting Copy in the name. It makes > little sense to pretend that either is the original and the other is > the copy. You just have the same data in two places. If one of them is > more authoritative, the place to explain that is in the comments, not > by elongating the structure member name and supposing anyone will be > able to make something of that. Ok. > 0009, 0010 - > > I think you've got this whole series of things divided up too finely. > Like, 0005 feels like the meat of it, and that has a bunch of things > in it that could plausible be separated out as separate commits. 0007 > also seems to do more than one kind of thing (see my comment regarding > moving some of that into 0006). But whacking everything around like a > crazy man in 0005 and a little more in 0007 and then doing the > following cleanup in these little tiny steps seems pretty lame. > Separating 0009 from 0010 is maybe the clearest example of that, but > IMHO it's pretty unclear why both of these shouldn't be merged with > 0008. I found it a *lot* easier to review / evolve them this way. I e.g. had an earlier version in which the subxid part of the change worked substantially differently (it tried to elide the overflowed bool, by definining -1 as the indicator for overflows), and it'd been way harder to change that if I didn't have a patch with *just* the subxid changes. I'd not push them separated by time, but I do think it'd make sense to push them as separate commits. I think it's easier to review them in case of a bug in a separate area. > My comments on the Copy naming apply here as well. I am also starting > to wonder why exactly we need two copies of all this stuff. Perhaps > I've just failed to absorb the idea for having read the patch too > briefly, but I think that we need to make sure that it's super-clear > why we're doing that. If we just needed it for one field because > $REASONS, that would be one thing, but if we need it for all of them > then there must be some underlying principle here that needs a good > explanation in an easy-to-find and centrally located place. The main reason is that we want to be able to cheaply check the current state of the variables (mostly when checking a backend's own state). We can't access the "dense" ones without holding a lock, but we e.g. don't want to make ProcArrayEndTransactionInternal() take a lock just to check if vacuumFlags is set. It turns out to also be good for performance to have the copy for another reason: The "dense" arrays share cachelines with other backends. That's worth it because it allows to make GetSnapshotData(), by far the most frequent operation, touch fewer cache lines. But it also means that it's more likely that a backend's "dense" array entry isn't in a local cpu cache (it'll be pulled out of there when modified in another backend). In many cases we don't need the shared entry at commit etc time though, we just need to check if it is set - and most of the time it won't be. The local entry allows to do that cheaply. Basically it makes sense to access the PGPROC variable when checking a single backend's data, especially when we have to look at the PGPROC for other reasons already. It makes sense to look at the "dense" arrays if we need to look at many / most entries, because we then benefit from the reduced indirection and better cross-process cacheability. > 0011 - > > + * Number of top-level transactions that completed in some form since the > + * start of the server. This currently is solely used to check whether > + * GetSnapshotData() needs to recompute the contents of the snapshot, or > + * not. There are likely other users of this. Always above 1. > > Does it only count XID-bearing transactions? If so, best mention that. Oh, good point. > +GetSnapshotDataFillTooOld(Snapshot snapshot) > > Uh... no clue what's going on here. Granted the code had no comments > in the old place either, so I guess it's not worse, but even the name > of the new function is pretty incomprehensible. It fills the old_snapshot_threshold fields of a Snapshot. > + * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go > > I know what it means to re-enter a building, but I don't know what it > means to re-enter the snapshot's xmin. Re-entering it into the procarray, thereby preventing rows that the snapshot could see from being removed. > This whole comment seems a bit murky. How about: /* * If the current xactCompletionCount is still the same as it was at the * time the snapshot was built, we can be sure that rebuilding the * contents of the snapshot the hard way would result in the same snapshot * contents: * * As explained in transam/README, the set of xids considered running by * GetSnapshotData() cannot change while ProcArrayLock is held. Snapshot *
Re: Improving connection scalability: GetSnapshotData()
On Tue, Apr 7, 2020 at 3:24 PM Andres Freund wrote: > > 0007 - > > > > + TransactionId xidCopy; /* this backend's xid, a copy of this proc's > > +ProcGlobal->xids[] entry. */ > > > > Can we please NOT put Copy into the name like that? Pretty please? > > Do you have a suggested naming scheme? Something indicating that it's > not the only place that needs to be updated? I don't think trying to indicate that in the structure member names is a useful idea. I think you should give them the same names, maybe with an "s" to pluralize the copy hanging off of ProcGlobal, and put a comment that says something like: We keep two copies of each of the following three fields. One copy is here in the PGPROC, and the other is in a more densely-packed array hanging off of PGXACT. Both copies of the value must always be updated at the same time and under the same locks, so that it is always the case that MyProc->xid == ProcGlobal->xids[MyProc->pgprocno] and similarly for vacuumFlags and WHATEVER. Note, however, that the arrays attached to ProcGlobal only contain entries for PGPROC structures that are currently part of the ProcArray (i.e. there is currently a backend for that PGPROC). We use those arrays when STUFF and the copies in the individual PGPROC when THINGS. > I think it's more on-point here, because we need to hold either of the > locks* even, for changes to a backend's own status that one reasonably > could expect would be safe to at least inspect. It's just too brief and obscure to be useful. > > + ProcGlobal->xids[pgxactoff] = InvalidTransactionId; > > > > Apparently this array is not dense in the sense that it excludes > > unused slots, but comments elsewhere don't seem to entirely agree. > > What do you mean with "unused slots"? Backends that committed? Backends that have no XID. You mean, I guess, that it is "dense" in the sense that only live backends are in there, not "dense" in the sense that only active write transactions are in there. It would be nice to nail that down better; the wording I suggested above might help. > > + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + > > max_prepared_xacts; > > > > /* ProcGlobal */ > > size = add_size(size, sizeof(PROC_HDR)); > > - /* MyProcs, including autovacuum workers and launcher */ > > - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC))); > > - /* AuxiliaryProcs */ > > - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC))); > > - /* Prepared xacts */ > > - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC))); > > - /* ProcStructLock */ > > + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC))); > > > > This seems like a bad idea. If we establish a precedent that it's OK > > to have sizing routines that don't use add_size() and mul_size(), > > people are going to cargo cult that into places where there is more > > risk of overflow than there is here. > > Hm. I'm not sure I see the problem. Are you concerned that TotalProcs > would overflow due to too big MaxBackends or max_prepared_xacts? The > multiplication itself is still protected by add_size(). It didn't seem > correct to use add_size for the TotalProcs addition, since that's not > really a size. And since the limit for procs is much lower than > UINT32_MAX... I'm concerned that there are 0 uses of add_size in any shared-memory sizing function, and I think it's best to keep it that way. If you initialize TotalProcs = add_size(MaxBackends, add_size(NUM_AUXILIARY_PROCS, max_prepared_xacts)) then I'm happy. I think it's a desperately bad idea to imagine that we can dispense with overflow checks here and be safe. It's just too easy for that to become false due to future code changes, or get copied to other places where it's unsafe now. > > You've got a bunch of different places that talk about the new PGXACT > > array and they are somewhat redundant yet without saying exactly the > > same thing every time either. I think that needs cleanup. > > Could you point out a few of those comments, I'm not entirely sure which > you're talking about? + /* + * Also allocate a separate arrays for data that is frequently (e.g. by + * GetSnapshotData()) accessed from outside a backend. There is one entry + * in each for every *live* PGPROC entry, and they are densely packed so + * that the first procArray->numProc entries are all valid. The entries + * for a PGPROC in those arrays are at PGPROC->pgxactoff. + * + * Note that they may not be accessed without ProcArrayLock held! Upon + * ProcArrayRemove() later entries will be moved. + * + * These are separate from the main PGPROC array so that the most heavily + * accessed data is stored contiguously in memory in as few cache lines as + * possible. This provides significant performance benefits, especially on + * a multiprocessor system. + */ + * Arrays with per-backend information that is hotly accessed, indexed by + * PGPROC->pgxactoff. These are in separate arrays for three reasons: + * First, to
Re: Improving connection scalability: GetSnapshotData()
On Tue, Apr 7, 2020 at 3:31 PM Andres Freund wrote: > Well, it *is* only a vague test :). It shouldn't ever have a false > positive, but there's plenty chance for false negatives (if wrapped > around far enough). Sure, but I think you get my point. Asserting that something "might be" true isn't much of an assertion. Saying that it's in the correct range is not to say there can't be a problem - but we're saying that it IS in the expect range, not that it may or may not be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-07 15:03:46 -0400, Robert Haas wrote: > On Tue, Apr 7, 2020 at 1:51 PM Andres Freund wrote: > > > ComputedHorizons seems like a fairly generic name. I think there's > > > some relationship between InvisibleToEveryoneState and > > > ComputedHorizons that should be brought out more clearly by the naming > > > and the comments. > > > > I don't like the naming of ComputedHorizons, ComputeTransactionHorizons > > much... But I find it hard to come up with something that's meaningfully > > better. > > It would help to stick XID in there, like ComputedXIDHorizons. What I > find really baffling is that you seem to have two structures in the > same file that have essentially the same purpose, but the second one > (ComputedHorizons) has a lot more stuff in it. I can't understand why. ComputedHorizons are the various "accurate" horizons computed by ComputeTransactionHorizons(). That's used to determine a horizon for vacuuming (via GetOldestVisibleTransactionId()) and other similar use cases. The various InvisibleToEveryoneState variables contain the boundary based horizons, and are updated / initially filled by GetSnapshotData(). When the a tested value falls between the boundaries, we update the approximate boundaries using ComputeTransactionHorizons(). That briefly makes the boundaries in the InvisibleToEveryoneState accurate - but future GetSnapshotData() calls will increase the definitely_needed_bound (if transactions committed since). The ComputedHorizons fields could instead just be pointer based arguments to ComputeTransactionHorizons(), but that seems clearly worse. I'll change ComputedHorizons's comment to say that it's the result of ComputeTransactionHorizons(), not the "state". > > What's the inconsistency? The dropped replication_ vs dropped FullXid > > postfix? > > Yeah, just having the member names be randomly different between the > structs. Really harms greppability. The long names make it hard to keep line lengths in control, in particular when also involving the long macro names for TransactionId / FullTransactionId comparators... > > > Generally, heap_prune_satisfies_vacuum looks pretty good. The > > > limited_oldest_committed naming is confusing, but the comments make it > > > a lot clearer. > > > > I didn't like _committed much either. But couldn't come up with > > something short. _relied_upon? > > oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed? Will go for old_snapshot_limit_used, and rename the other variables to old_snapshot_limit_ts, old_snapshot_limit_xmin. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-07 14:51:52 -0400, Robert Haas wrote: > On Tue, Apr 7, 2020 at 2:28 PM Andres Freund wrote: > > Does that make some sense? Do you have a better suggestion for a name? > > I think it makes sense. I have two basic problems with the name. The > first is that "on disk" doesn't seem to be a very clear way of > describing what you're actually checking here, and it definitely > doesn't refer to an existing concept which sophisticated hackers can > be expected to understand. The second is that "may" is ambiguous in > English: it can either mean that something is permissible ("Johnny, > you may go to the bathroom") or that we do not have certain knowledge > of it ("Johnny may be in the bathroom"). When it is followed by "be", > it usually has the latter sense, although there are exceptions (e.g. > "She may be discharged from the hospital today if she wishes, but we > recommend that she stay for another day"). Consequently, I found that > use of "may be" in this context wicked confusing. Well, it *is* only a vague test :). It shouldn't ever have a false positive, but there's plenty chance for false negatives (if wrapped around far enough). > So I suggest a name with "Is" or no verb, rather than one with > "MayBe." And I suggest something else instead of "OnDisk," e.g. > AssertTransactionIdIsInUsableRange() or > TransactionIdIsInAllowableRange() or > AssertTransactionIdWraparoundProtected(). I kind of like that last > one, but YMMV. Make sense - but they all seem to express a bit more certainty than I think the test actually provides. I explicitly did not want (and added a comment to that affect) have something like TransactionIdIsInAllowableRange(), because there never can be a safe use of its return value, as far as I can tell. The "OnDisk" was intended to clarify that the range it verifies is whether it'd be ok for the xid to have been found in a relation. Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
0008 - Here again, I greatly dislike putting Copy in the name. It makes little sense to pretend that either is the original and the other is the copy. You just have the same data in two places. If one of them is more authoritative, the place to explain that is in the comments, not by elongating the structure member name and supposing anyone will be able to make something of that. + * + * XXX: That's why this is using vacuumFlagsCopy. I am not sure there's any problem with the code that needs fixing here, so I might think about getting rid of this XXX. But this gets back to my complaint about the locking regime being unclear. What I think you need to do here is rephrase the previous paragraph so that it explains the reason for using this copy a bit better. Like "We read the copy of vacuumFlags from PGPROC rather than visiting the copy attached to ProcGlobal because we can do that without taking a lock. See fuller explanation in ." Or whatever. 0009, 0010 - I think you've got this whole series of things divided up too finely. Like, 0005 feels like the meat of it, and that has a bunch of things in it that could plausible be separated out as separate commits. 0007 also seems to do more than one kind of thing (see my comment regarding moving some of that into 0006). But whacking everything around like a crazy man in 0005 and a little more in 0007 and then doing the following cleanup in these little tiny steps seems pretty lame. Separating 0009 from 0010 is maybe the clearest example of that, but IMHO it's pretty unclear why both of these shouldn't be merged with 0008. To be clear, I exaggerate for effect. 0005 is not whacking everything around like a crazy man. But it is a non-minimal patch, whereas I consider 0009 and 0010 to be sub-minimal. My comments on the Copy naming apply here as well. I am also starting to wonder why exactly we need two copies of all this stuff. Perhaps I've just failed to absorb the idea for having read the patch too briefly, but I think that we need to make sure that it's super-clear why we're doing that. If we just needed it for one field because $REASONS, that would be one thing, but if we need it for all of them then there must be some underlying principle here that needs a good explanation in an easy-to-find and centrally located place. 0011 - + * Number of top-level transactions that completed in some form since the + * start of the server. This currently is solely used to check whether + * GetSnapshotData() needs to recompute the contents of the snapshot, or + * not. There are likely other users of this. Always above 1. Does it only count XID-bearing transactions? If so, best mention that. + * transactions completed since the last GetSnapshotData().. Too many periods. + /* Same with CSN */ + ShmemVariableCache->xactCompletionCount++; If I didn't know that CSN stood for commit sequence number from reading years of mailing list traffic, I'd be lost here. So I think this comment shouldn't use that term. +GetSnapshotDataFillTooOld(Snapshot snapshot) Uh... no clue what's going on here. Granted the code had no comments in the old place either, so I guess it's not worse, but even the name of the new function is pretty incomprehensible. + * Helper function for GetSnapshotData() that check if the bulk of the checks + * the fields that need to change and returns true. false is returned + * otherwise. Otherwise, it returns false. + * It is safe to re-enter the snapshot's xmin. This can't cause xmin to go I know what it means to re-enter a building, but I don't know what it means to re-enter the snapshot's xmin. This whole comment seems a bit murky. ...Robert
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-07 14:28:09 -0400, Robert Haas wrote: > More review, since it sounds like you like it: > > 0006 - Boring. But I'd probably make this move both xmin and xid back, > with related comment changes; see also next comment. > > 0007 - > > + TransactionId xidCopy; /* this backend's xid, a copy of this proc's > +ProcGlobal->xids[] entry. */ > > Can we please NOT put Copy into the name like that? Pretty please? Do you have a suggested naming scheme? Something indicating that it's not the only place that needs to be updated? > + int pgxactoff; /* offset into various ProcGlobal-> arrays > + * NB: can change any time unless locks held! > + */ > > I'm going to add the helpful comment "NB: can change any time unless > locks held!" to every data structure in the backend that is in shared > memory and not immutable. No need, of course, to mention WHICH > locks... I think it's more on-point here, because we need to hold either of the locks* even, for changes to a backend's own status that one reasonably could expect would be safe to at least inspect. E.g looking at ProcGlobal->xids[MyProc->pgxactoff] doesn't look suspicious, but could very well return another backends xid, if neither ProcArrayLock nor XidGenLock is held (because a ProcArrayRemove() could have changed pgxactoff if a previous entry was removed). *see comment at PROC_HDR: * * Adding/Removing an entry into the procarray requires holding *both* * ProcArrayLock and XidGenLock in exclusive mode (in that order). Both are * needed because the dense arrays (see below) are accessed from * GetNewTransactionId() and GetSnapshotData(), and we don't want to add * further contention by both using one lock. Adding/Removing a procarray * entry is much less frequent. */ typedef struct PROC_HDR { /* Array of PGPROC structures (not including dummies for prepared txns) */ PGPROC *allProcs; /* * Arrays with per-backend information that is hotly accessed, indexed by * PGPROC->pgxactoff. These are in separate arrays for three reasons: * First, to allow for as tight loops accessing the data as * possible. Second, to prevent updates of frequently changing data from * invalidating cachelines shared with less frequently changing * data. Third to condense frequently accessed data into as few cachelines * as possible. * * When entering a PGPROC for 2PC transactions with ProcArrayAdd(), those * copies are used to provide the contents of the dense data, and will be * transferred by ProcArrayAdd() while it already holds ProcArrayLock. */ there's also * The various *Copy fields are copies of the data in ProcGlobal arrays that * can be accessed without holding ProcArrayLock / XidGenLock (see PROC_HDR * comments). I had a more explicit warning/explanation about the dangers of accessing the arrays without locks, but apparently went to far when reducing duplicated comments. > On a related note, PROC_HDR really, really, really needs comments > explaining the locking regimen for the new xids field. I'll expand the above, in particular highlighting the danger of pgxactoff changing. > + ProcGlobal->xids[pgxactoff] = InvalidTransactionId; > > Apparently this array is not dense in the sense that it excludes > unused slots, but comments elsewhere don't seem to entirely agree. What do you mean with "unused slots"? Backends that committed? Dense is intended to describe that the arrays only contain currently "live" entries. I.e. that the first procArray->numProcs entries in each array have the data for all procs (including prepared xacts) that are "connected". This is extending the concept that already existed for procArray->pgprocnos. Wheras the PGPROC/PGXACT arrays have "unused" entries interspersed. This is what previously lead to the slow loop in GetSnapshotData(), where we had to iterate over PGXACTs over an indirection in procArray->pgprocnos. I.e. to only look at in-use PGXACTs we had to go through allProcs[arrayP->pgprocnos[i]], which is, uh, suboptimal for a performance critical routine holding a central lock. I'll try to expand the comments around dense, but if you have a better descriptor. > Maybe the comments discussing how it is "dense" need to be a little > more precise about this. > > + for (int i = 0; i < nxids; i++) > > I miss my C89. Yeah, it's just me. Oh, dear god. I hate declaring variables like 'i' on function scope. The bug that haunted me the longest in the development of this patch was in XidCacheRemoveRunningXids, where there are both i and j, and a macro XidCacheRemove(i), but the macro gets passed j as i. > - if (!suboverflowed) > + if (suboverflowed) > + continue; > + > > Do we really need to do this kind of diddling in this patch? I mean > yes to the idea, but no to things that are going to make it harder to > understand what happened if this blows up. I can try to reduce
Re: Improving connection scalability: GetSnapshotData()
On Tue, Apr 7, 2020 at 1:51 PM Andres Freund wrote: > > ComputedHorizons seems like a fairly generic name. I think there's > > some relationship between InvisibleToEveryoneState and > > ComputedHorizons that should be brought out more clearly by the naming > > and the comments. > > I don't like the naming of ComputedHorizons, ComputeTransactionHorizons > much... But I find it hard to come up with something that's meaningfully > better. It would help to stick XID in there, like ComputedXIDHorizons. What I find really baffling is that you seem to have two structures in the same file that have essentially the same purpose, but the second one (ComputedHorizons) has a lot more stuff in it. I can't understand why. > What's the inconsistency? The dropped replication_ vs dropped FullXid > postfix? Yeah, just having the member names be randomly different between the structs. Really harms greppability. > > Generally, heap_prune_satisfies_vacuum looks pretty good. The > > limited_oldest_committed naming is confusing, but the comments make it > > a lot clearer. > > I didn't like _committed much either. But couldn't come up with > something short. _relied_upon? oldSnapshotLimitUsed or old_snapshot_limit_used, like currentCommandIdUsed? > It's just adjusting for the changed name of latestCompletedXid to > latestCompletedFullXid, as part of widening it to 64bits. I'm not > really a fan of adding that to the variable name, but surrounding code > already did it (cf VariableCache->nextFullXid), so I thought I'd follow > suit. Oops, that was me misreading the diff. Sorry for the noise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improving connection scalability: GetSnapshotData()
On Tue, Apr 7, 2020 at 2:28 PM Andres Freund wrote: > Does that make some sense? Do you have a better suggestion for a name? I think it makes sense. I have two basic problems with the name. The first is that "on disk" doesn't seem to be a very clear way of describing what you're actually checking here, and it definitely doesn't refer to an existing concept which sophisticated hackers can be expected to understand. The second is that "may" is ambiguous in English: it can either mean that something is permissible ("Johnny, you may go to the bathroom") or that we do not have certain knowledge of it ("Johnny may be in the bathroom"). When it is followed by "be", it usually has the latter sense, although there are exceptions (e.g. "She may be discharged from the hospital today if she wishes, but we recommend that she stay for another day"). Consequently, I found that use of "may be" in this context wicked confusing. What came to mind was: bool RobertMayBeAGiraffe(void) { return true; // i mean, i haven't seen him since last week, so who knows? } So I suggest a name with "Is" or no verb, rather than one with "MayBe." And I suggest something else instead of "OnDisk," e.g. AssertTransactionIdIsInUsableRange() or TransactionIdIsInAllowableRange() or AssertTransactionIdWraparoundProtected(). I kind of like that last one, but YMMV. I wish to clarify that in sending these review emails I am taking no position on whether or not it is prudent to commit any or all of them. I do not think we can rule out the possibility that they will Break Things, but neither do I wish to be seen as That Guy Who Stands In The Way of Important Improvements. Time does not permit me a detailed review anyway. So, these comments are provided in the hope that they may be useful but without endorsement or acrimony. If other people want to endorse or, uh, acrimoniate, based on my comments, that is up to them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Improving connection scalability: GetSnapshotData()
On Tue, Apr 7, 2020 at 11:28 AM Andres Freund wrote: > There is a lot of code that is pretty unsafe around wraparounds... They > are getting easier and easier to hit on a regular schedule in production > (plenty of databases that hit wraparounds multiple times a week). And I > don't think we as PG developers often don't quite take that into > account. It would be nice if there was high level documentation on wraparound hazards. Maybe even a dedicated README file. -- Peter Geoghegan
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-07 10:51:12 -0700, Andres Freund wrote: > > +void AssertTransactionIdMayBeOnDisk(TransactionId xid) > > > > Formatting. > > > > + * Assert that xid is one that we could actually see on disk. > > > > I don't know what this means. The whole purpose of this routine is > > very unclear to me. > > It's intended to be a double check against forgetting things...? Err: It is intended to make it easier to detect cases where the passed TransactionId is not safe against wraparound. If there is protection against wraparound, then the xid a) may never be older than ShmemVariableCache->oldestXid (since otherwise the rel/datfrozenxid could not have advanced past the xid, and because oldestXid is what what prevents ->nextFullXid from advancing far enough to cause a wraparound) b) cannot be >= ShmemVariableCache->nextFullXid. If it is, it cannot recently have come from GetNewTransactionId(), and thus there is no anti-wraparound protection either. As full wraparounds are painful to exercise in testing, AssertTransactionIdMayBeOnDisk() is intended to make it easier to detect potential hazards. The reason for the *OnDisk naming is that [oldestXid, nextFullXid) is the appropriate check for values actually stored in tables. There could, and probably should, be a narrower assertion ensuring that a xid is protected against being pruned away (i.e. a PGPROC's xmin covering it). The reason for being concerned enough in the new code to add the new assertion helper (as well as a major motivating reason for making the horizons 64 bit xids) is that it's much harder to ensure that "global xmin" style horizons don't wrap around. By definition they include other backend's ->xmin, and those can be released without a lock at any time. As a lot of wraparound issues are triggered by very longrunning transactions, it is not even unlikely to hit such problems: At some point somebody is going to kill that old backend and ->oldestXid will advance very quickly. There is a lot of code that is pretty unsafe around wraparounds... They are getting easier and easier to hit on a regular schedule in production (plenty of databases that hit wraparounds multiple times a week). And I don't think we as PG developers often don't quite take that into account. Does that make some sense? Do you have a better suggestion for a name? Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
More review, since it sounds like you like it: 0006 - Boring. But I'd probably make this move both xmin and xid back, with related comment changes; see also next comment. 0007 - + TransactionId xidCopy; /* this backend's xid, a copy of this proc's +ProcGlobal->xids[] entry. */ Can we please NOT put Copy into the name like that? Pretty please? + int pgxactoff; /* offset into various ProcGlobal-> arrays + * NB: can change any time unless locks held! + */ I'm going to add the helpful comment "NB: can change any time unless locks held!" to every data structure in the backend that is in shared memory and not immutable. No need, of course, to mention WHICH locks... On a related note, PROC_HDR really, really, really needs comments explaining the locking regimen for the new xids field. + ProcGlobal->xids[pgxactoff] = InvalidTransactionId; Apparently this array is not dense in the sense that it excludes unused slots, but comments elsewhere don't seem to entirely agree. Maybe the comments discussing how it is "dense" need to be a little more precise about this. + for (int i = 0; i < nxids; i++) I miss my C89. Yeah, it's just me. - if (!suboverflowed) + if (suboverflowed) + continue; + Do we really need to do this kind of diddling in this patch? I mean yes to the idea, but no to things that are going to make it harder to understand what happened if this blows up. + uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts; /* ProcGlobal */ size = add_size(size, sizeof(PROC_HDR)); - /* MyProcs, including autovacuum workers and launcher */ - size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC))); - /* AuxiliaryProcs */ - size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC))); - /* Prepared xacts */ - size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGPROC))); - /* ProcStructLock */ + size = add_size(size, mul_size(TotalProcs, sizeof(PGPROC))); This seems like a bad idea. If we establish a precedent that it's OK to have sizing routines that don't use add_size() and mul_size(), people are going to cargo cult that into places where there is more risk of overflow than there is here. You've got a bunch of different places that talk about the new PGXACT array and they are somewhat redundant yet without saying exactly the same thing every time either. I think that needs cleanup. One thing I didn't see is any clear discussion of what happens if the two copies of the XID information don't agree with each other. That should be added someplace, either in an appropriate code comment or in a README or something. I *think* both are protected by the same locks, but there's also some unlocked access to those structure members, so it's not entirely a slam dunk. ...Robert
Re: Improving connection scalability: GetSnapshotData()
Hi, Thanks for the review! On 2020-04-07 12:41:07 -0400, Robert Haas wrote: > In 0002, the comments in SnapshotSet() are virtually incomprehensible. > There's no commit message so the reasons for the changes are unclear. > But mostly looks unproblematic. I was planning to drop that patch pre-commit, at least for now. I think there's a few live bugs here, but they're all older. I did send a few emails about the class of problem, unfortunately it was a fairly one-sided conversation so far ;) https://www.postgresql.org/message-id/20200407072418.ccvnyjbrktyi3rzc%40alap3.anarazel.de > 0003 looks like a fairly unrelated bug fix that deserves to be > discussed on the thread related to the original patch. Probably should > be an open item. There was some discussion in a separate thread: https://www.postgresql.org/message-id/20200406025651.fpzdb5yyb7qyhqko%40alap3.anarazel.de The only reason for including it in this patch stack is that I can't really execercise the patchset without the fix (it's a bit sad that this issue has gone unnoticed for months before I found it as part of the development of this patch). Think I'll push a minimal version now, and add an open item. > > Regarding 0005: > > There's sort of a mix of terminology here: are we pruning tuples or > removing tuples or testing whether things are invisible? It would be > better to be more consistent. > > + * State for testing whether tuple versions may be removed. To improve > + * GetSnapshotData() performance we don't compute an accurate value whenever > + * acquiring a snapshot. Instead we compute boundaries above/below which we > + * know that row versions are [not] needed anymore. If at test time values > + * falls in between the two, the boundaries can be recomputed (unless that > + * just happened). > > I don't like the wording here much. Maybe: State for testing whether > an XID is invisible to all current snapshots. If an XID precedes > maybe_needed_bound, it's definitely not visible to any current > snapshot. If it equals or follows definitely_needed_bound, that XID > isn't necessarily invisible to all snapshots. If it falls in between, > we're not sure. If, when testing a particular tuple, we see an XID > somewhere in the middle, we can try recomputing the boundaries to get > a more accurate answer (unless we've just done that). This is cheaper > than maintaining an accurate value all the time. I'll incorporate that, thanks. > There's also the problem that this sorta contradicts the comment for > definitely_needed_bound. There it says intermediate values needed to > be tested against the ProcArray, whereas here it says we need to > recompute the bounds. That's kinda confusing. For me those are the same. Computing an accurate bound is visitting the procarray. But I'll rephrase. > ComputedHorizons seems like a fairly generic name. I think there's > some relationship between InvisibleToEveryoneState and > ComputedHorizons that should be brought out more clearly by the naming > and the comments. I don't like the naming of ComputedHorizons, ComputeTransactionHorizons much... But I find it hard to come up with something that's meaningfully better. I'll add a comment. > + /* > + * The value of ShmemVariableCache->latestCompletedFullXid when > + * ComputeTransactionHorizons() held ProcArrayLock. > + */ > + FullTransactionId latest_completed; > + > + /* > + * The same for procArray->replication_slot_xmin and. > + * procArray->replication_slot_catalog_xmin. > + */ > + TransactionId slot_xmin; > + TransactionId slot_catalog_xmin; > > Department of randomly inconsistent names. In general I think it's > quite hard to grasp the relationship between the different fields in > ComputedHorizons. What's the inconsistency? The dropped replication_ vs dropped FullXid postfix? > + > +bool > +GinPageIsRecyclable(Page page) > > Needs a comment. Or more than one. Well, I started to write one a couple times. But it's really just moving the pre-existing code from the macro into a function and there weren't any comments around *any* of it before. All my comment attempts basically just were restating the code in so many words, or would have required more work than I saw justified in the context of just moving code. > - /* > - * If a transaction wrote a commit record in the gap between taking and > - * logging the snapshot then latestCompletedXid may already be higher than > - * the value from the snapshot, so check before we use the incoming value. > - */ > - if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, > - running->latestCompletedXid)) > - ShmemVariableCache->latestCompletedXid = running->latestCompletedXid; > - > - Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid)); > - > - LWLockRelease(ProcArrayLock); > > This code got relocated so that the lock is released later, but you > didn't add any comments explaining why. Somebody will move it back and > then you'll yet at them for doing it wrong.
Re: Improving connection scalability: GetSnapshotData()
Comments: In 0002, the comments in SnapshotSet() are virtually incomprehensible. There's no commit message so the reasons for the changes are unclear. But mostly looks unproblematic. 0003 looks like a fairly unrelated bug fix that deserves to be discussed on the thread related to the original patch. Probably should be an open item. 0004 looks fine. Regarding 0005: There's sort of a mix of terminology here: are we pruning tuples or removing tuples or testing whether things are invisible? It would be better to be more consistent. + * State for testing whether tuple versions may be removed. To improve + * GetSnapshotData() performance we don't compute an accurate value whenever + * acquiring a snapshot. Instead we compute boundaries above/below which we + * know that row versions are [not] needed anymore. If at test time values + * falls in between the two, the boundaries can be recomputed (unless that + * just happened). I don't like the wording here much. Maybe: State for testing whether an XID is invisible to all current snapshots. If an XID precedes maybe_needed_bound, it's definitely not visible to any current snapshot. If it equals or follows definitely_needed_bound, that XID isn't necessarily invisible to all snapshots. If it falls in between, we're not sure. If, when testing a particular tuple, we see an XID somewhere in the middle, we can try recomputing the boundaries to get a more accurate answer (unless we've just done that). This is cheaper than maintaining an accurate value all the time. There's also the problem that this sorta contradicts the comment for definitely_needed_bound. There it says intermediate values needed to be tested against the ProcArray, whereas here it says we need to recompute the bounds. That's kinda confusing. ComputedHorizons seems like a fairly generic name. I think there's some relationship between InvisibleToEveryoneState and ComputedHorizons that should be brought out more clearly by the naming and the comments. + /* + * The value of ShmemVariableCache->latestCompletedFullXid when + * ComputeTransactionHorizons() held ProcArrayLock. + */ + FullTransactionId latest_completed; + + /* + * The same for procArray->replication_slot_xmin and. + * procArray->replication_slot_catalog_xmin. + */ + TransactionId slot_xmin; + TransactionId slot_catalog_xmin; Department of randomly inconsistent names. In general I think it's quite hard to grasp the relationship between the different fields in ComputedHorizons. +static inline bool OldSnapshotThresholdActive(void) +{ + return old_snapshot_threshold >= 0; +} Formatting. + +bool +GinPageIsRecyclable(Page page) Needs a comment. Or more than one. - /* - * If a transaction wrote a commit record in the gap between taking and - * logging the snapshot then latestCompletedXid may already be higher than - * the value from the snapshot, so check before we use the incoming value. - */ - if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, - running->latestCompletedXid)) - ShmemVariableCache->latestCompletedXid = running->latestCompletedXid; - - Assert(TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid)); - - LWLockRelease(ProcArrayLock); This code got relocated so that the lock is released later, but you didn't add any comments explaining why. Somebody will move it back and then you'll yet at them for doing it wrong. :-) + /* + * Must have called GetOldestVisibleTransactionId() if using SnapshotAny. + * Shouldn't have for an MVCC snapshot. (It's especially worth checking + * this for parallel builds, since ambuild routines that support parallel + * builds must work these details out for themselves.) + */ + Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot)); + Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) : +!TransactionIdIsValid(OldestXmin)); + Assert(snapshot == SnapshotAny || !anyvisible); This looks like a gratuitous code relocation. +HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *dead_after) I don't much like the name dead_after, but I don't have a better suggestion, either. - * Deleter committed, but perhaps it was recent enough that some open - * transactions could still see the tuple. + * Deleter committed, allow caller to check if it was recent enough that + * some open transactions could still see the tuple. I think you could drop this change. + /* + * State related to determining whether a dead tuple is still needed. + */ + InvisibleToEveryoneState *vistest; + TimestampTz limited_oldest_ts; + TransactionId limited_oldest_xmin; + /* have we made removal decision based on old_snapshot_threshold */ + bool limited_oldest_committed; Would benefit from more comments. + * accuring to prstate->vistest, but that can be removed based on Typo. Generally, heap_prune_satisfies_vacuum looks pretty good. The limited_oldest_committed naming is confusing, but the comments make it a lot clearer. + * If oldest btpo.xact in the deleted
Re: Improving connection scalability: GetSnapshotData()
On 4/7/20 8:15 AM, Andres Freund wrote: > I think this is pretty close to being committable. > > > But: This patch came in very late for v13, and it took me much longer to > polish it up than I had hoped (partially distraction due to various bugs > I found (in particular snapshot_too_old), partially covid19, partially > "hell if I know"). The patchset touches core parts of the system. While > both Thomas and David have done some review, they haven't for the latest > version (mea culpa). > > In many other instances I would say that the above suggests slipping to > v14, given the timing. > > The main reason I am considering pushing is that I think this patcheset > addresses one of the most common critiques of postgres, as well as very > common, hard to fix, real-world production issues. GetSnapshotData() has > been a major bottleneck for about as long as I have been using postgres, > and this addresses that to a significant degree. > > A second reason I am considering it is that, in my opinion, the changes > are not all that complicated and not even that large. At least not for a > change to a problem that we've long tried to improve. Even as recently as earlier this week there was a blog post making the rounds about the pain points running PostgreSQL with many simultaneous connections. Anything to help with that would go a long way, and looking at the benchmarks you ran (at least with a quick, nonthorough glance) this could and should be very positively impactful to a *lot* of PostgreSQL users. I can't comment on the "close to committable" aspect (at least not with an informed, confident opinion) but if it is indeed close to committable and you can put the work to finish polishing (read: "bug fixing" :-) and we have a plan both of testing and, if need be, to revert, I would be okay with including it, for whatever my vote is worth. Is the timing / situation ideal? No, but the way you describe it, it sounds like there is enough that can be done to ensure it's ready for Beta 1. From a RMT standpoint, perhaps this is one of the "Recheck at Mid-Beta" items, as well. Thanks, Jonathan signature.asc Description: OpenPGP digital signature
Re: Improving connection scalability: GetSnapshotData()
On 2020-04-07 05:15:03 -0700, Andres Freund wrote: > Attached is a substantially polished version of my patches. Note that > the first three patches, as well as the last, are not intended to be > committed at this time / in this form - they're there to make testing > easier. I didn't actually attached that last not-to-be-committed patch... It's just the pgbench patch that I had attached before (and started a thread about). Here it is again. >From 59a9a03da728d53364f9c3d6fe8b48e21697b93e Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 6 Apr 2020 21:28:55 -0700 Subject: [PATCH v7 12/12] WIP: pgbench --- src/bin/pgbench/pgbench.c | 107 +- 1 file changed, 83 insertions(+), 24 deletions(-) diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index e99af801675..21d1ab2aac1 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -310,6 +310,10 @@ typedef struct RandomState /* Various random sequences are initialized from this one. */ static RandomState base_random_sequence; +#ifdef ENABLE_THREAD_SAFETY +pthread_barrier_t conn_barrier; +#endif + /* * Connection state machine states. */ @@ -5206,6 +5210,10 @@ printResults(StatsData *total, instr_time total_time, tps_exclude = ntx / (time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients)); + //fprintf(stderr, "time: include: %f, exclude: %f, conn total: %f\n", + // time_include, time_include - (INSTR_TIME_GET_DOUBLE(conn_total_time) / nclients), + // INSTR_TIME_GET_DOUBLE(conn_total_time)); + /* Report test parameters. */ printf("transaction type: %s\n", num_scripts == 1 ? sql_script[0].desc : "multiple scripts"); @@ -6126,26 +6134,14 @@ main(int argc, char **argv) /* all clients must be assigned to a thread */ Assert(nclients_dealt == nclients); - /* get start up time */ - INSTR_TIME_SET_CURRENT(start_time); - - /* set alarm if duration is specified. */ - if (duration > 0) - setalarm(duration); - /* start threads */ #ifdef ENABLE_THREAD_SAFETY + pthread_barrier_init(_barrier, NULL, nthreads); + for (i = 0; i < nthreads; i++) { TState *thread = [i]; - INSTR_TIME_SET_CURRENT(thread->start_time); - - /* compute when to stop */ - if (duration > 0) - end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) + -(int64) 100 * duration; - /* the first thread (i = 0) is executed by main thread */ if (i > 0) { @@ -6162,13 +6158,38 @@ main(int argc, char **argv) thread->thread = INVALID_THREAD; } } -#else - INSTR_TIME_SET_CURRENT(threads[0].start_time); - /* compute when to stop */ +#endif /* ENABLE_THREAD_SAFETY */ + +#ifdef ENABLE_THREAD_SAFETY + /* wait till all threads started (threads wait in threadRun()) */ + //fprintf(stderr, "andres: waiting for thread start: %u\n", threads[0].tid); + pthread_barrier_wait(_barrier); +#endif /* ENABLE_THREAD_SAFETY */ + + /* get start up time */ + INSTR_TIME_SET_CURRENT(start_time); + + /* */ + for (i = 0; i < nthreads; i++) + { + TState *thread = [i]; + + thread->start_time = start_time; + + /* compute when to stop */ + if (duration > 0) + end_time = INSTR_TIME_GET_MICROSEC(thread->start_time) + +(int64) 100 * duration; + } + + /* set alarm if duration is specified. */ if (duration > 0) - end_time = INSTR_TIME_GET_MICROSEC(threads[0].start_time) + - (int64) 100 * duration; - threads[0].thread = INVALID_THREAD; + setalarm(duration); + +#ifdef ENABLE_THREAD_SAFETY + /* updated start time (threads wait in threadRun()) */ + //fprintf(stderr, "andres: %u: waiting for start time\n", threads[0].tid); + pthread_barrier_wait(_barrier); #endif /* ENABLE_THREAD_SAFETY */ /* wait for threads and accumulate results */ @@ -6236,12 +6257,30 @@ threadRun(void *arg) int i; /* for reporting progress: */ - int64 thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time); - int64 last_report = thread_start; - int64 next_report = last_report + (int64) progress * 100; + int64 thread_start; + int64 last_report; + int64 next_report; StatsData last, aggs; + /* wait till all threads started (main waits outside) */ + if (thread->tid != 0) + { + //fprintf(stderr, "andres: %u: waiting for thread start\n", thread->tid); + pthread_barrier_wait(_barrier); + } + + /* wait for start time to be initialized (main waits outside) */ + if (thread->tid != 0) + { + //fprintf(stderr, "andres: %u: waiting for start time\n", thread->tid); + pthread_barrier_wait(_barrier); + } + + thread_start = INSTR_TIME_GET_MICROSEC(thread->start_time); + last_report = thread_start; + next_report = last_report + (int64) progress * 100; + /* * Initialize throttling rate target for all of the thread's clients. It * might be a little more accurate to reset thread->start_time here too. @@ -6288,7 +6327,27 @@ threadRun(void *arg) /* time after thread and connections set up */ INSTR_TIME_SET_CURRENT(thread->conn_time); -
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-06 06:39:59 -0700, Andres Freund wrote: > > 3) Plain pgbench read-write (you already did it for sure) > > -s 100 -M prepared -T 700 > > autovacuum=off, fsync on: > clients tps master tps pgxact > 1 474 479 > 16 43564476 > 40 85919309 > 198 20045 20261 > 102417986 18545 > > autovacuum=off, fsync off: > clients tps master tps pgxact > 1 78287719 > 16 49069 50482 > 40 68241 73081 > 198 73464 77801 > 102425621 28410 > > I chose autovacuum off because otherwise the results vary much more > widely, and autovacuum isn't really needed for the workload. > These numbers seem pretty decent to me. The regressions seem mostly > within noise. The one possible exception to that is plain pgbench > read/write with fsync=off and only a single session. I'll run more > benchmarks around that tomorrow (but now it's 6am :(). The "one possible exception" turned out to be a "real" regression, but one that was dead easy to fix: It was an DEBUG1 elog I had left in. The overhead seems to solely have been the increased code size + overhead of errstart(). After that there's no difference in the single client case anymore (I'd not expect a benefit). Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-04-06 06:39:59 -0700, Andres Freund wrote: > These benchmarks are on my workstation. The larger VM I used in the last > round wasn't currently available. One way to reproduce the problem at smaller connection counts / smaller machines is to take more snapshots. Doesn't fully reproduce the problem, because resetting ->xmin without xact overhead is part of the problem, but it's helpful. I use a volatile function that loops over a trivial statement. There's probably an easier / more extreme way to reproduce the problem. But it's good enough. -- setup CREATE OR REPLACE FUNCTION snapme(p_ret int, p_loop int) RETURNS int VOLATILE LANGUAGE plpgsql AS $$BEGIN FOR x in 1..p_loop LOOP EXECUTE 'SELECT 1';END LOOP; RETURN p_ret; END;$$; -- statement executed in parallel SELECT snapme(17, 1); before (all above 1.5%): + 37.82% postgres postgres [.] GetSnapshotData +6.26% postgres postgres [.] AllocSetAlloc +3.77% postgres postgres [.] base_yyparse +3.04% postgres postgres [.] core_yylex +1.94% postgres postgres [.] grouping_planner +1.83% postgres libc-2.30.so [.] __strncpy_avx2 +1.80% postgres postgres [.] palloc +1.73% postgres libc-2.30.so [.] __memset_avx2_unaligned_erms after: +5.75% postgres postgres [.] base_yyparse +4.37% postgres postgres [.] palloc +4.29% postgres postgres [.] AllocSetAlloc +3.75% postgres postgres [.] expression_tree_walker.part.0 +3.14% postgres postgres [.] core_yylex +2.51% postgres postgres [.] subquery_planner +2.48% postgres postgres [.] CheckExprStillValid +2.45% postgres postgres [.] check_stack_depth +2.42% postgres plpgsql.so[.] exec_stmt +1.92% postgres libc-2.30.so [.] __memset_avx2_unaligned_erms +1.91% postgres postgres [.] query_tree_walker +1.88% postgres libc-2.30.so [.] __GI_strtoll_l_internal +1.86% postgres postgres [.] _SPI_execute_plan +1.85% postgres postgres [.] assign_query_collations_walker +1.84% postgres postgres [.] remove_useless_results_recurse +1.83% postgres postgres [.] grouping_planner +1.50% postgres postgres [.] set_plan_refs If I change the workload to be BEGIN; SELECT txid_current(); SELECT snapme(17, 1000); COMMIT; the difference reduces (because GetSnapshotData() only needs to look at procs with xids, and xids are assigned for much longer), but still is significant: before (all above 1.5%): + 35.89% postgres postgres[.] GetSnapshotData +7.94% postgres postgres[.] AllocSetAlloc +4.42% postgres postgres[.] base_yyparse +3.62% postgres libc-2.30.so[.] __memset_avx2_unaligned_erms +2.87% postgres postgres[.] LWLockAcquire +2.76% postgres postgres[.] core_yylex +2.30% postgres postgres[.] expression_tree_walker.part.0 +1.81% postgres postgres[.] MemoryContextAllocZeroAligned +1.80% postgres postgres[.] transformStmt +1.66% postgres postgres[.] grouping_planner +1.64% postgres postgres[.] subquery_planner after: + 24.59% postgres postgres [.] GetSnapshotData +4.89% postgres postgres [.] base_yyparse +4.59% postgres postgres [.] AllocSetAlloc +3.00% postgres postgres [.] LWLockAcquire +2.76% postgres postgres [.] palloc +2.27% postgres postgres [.] MemoryContextAllocZeroAligned +2.26% postgres postgres [.] check_stack_depth +1.77% postgres postgres [.] core_yylex Greetings, Andres Freund
Re: Improving connection scalability: GetSnapshotData()
Hi, These benchmarks are on my workstation. The larger VM I used in the last round wasn't currently available. HW: 2 x Intel(R) Xeon(R) Gold 5215 (each 10 cores / 20 threads) 192GB Ram. data directory is on a Samsung SSD 970 PRO 1TB A bunch of terminals, emacs, mutt are open while the benchmark is running. No browser. Unless mentioned otherwise, relevant configuration options are: max_connections=1200 shared_buffers=8GB max_prepared_transactions=1000 synchronous_commit=local huge_pages=on fsync=off # to make it more likely to see scalability bottlenecks Independent of the effects of this patch (i.e. including master) I had a fairly hard time getting reproducible number for *low* client cases. I found the numbers to be more reproducible if I pinned server/pgbench onto the same core :(. I chose to do that for the -c1 cases, to benchmark the optimal behaviour, as that seemed to have the biggest potential for regressions. All numbers are best of three. Tests start in freshly created cluster each. On 2020-03-30 17:04:00 +0300, Alexander Korotkov wrote: > Following pgbench scripts comes first to my mind: > 1) SELECT txid_current(); (artificial but good for checking corner case) -M prepared -T 180 (did a few longer runs, but doesn't seem to matter much) clients tps master tps pgxact 1 46118 46027 16 377357 440233 40 373304 410142 198 103912 105579 btw, there's some pretty horrible cacheline bouncing in txid_current() because backends first ReadNextFullTransactionId() (acquires XidGenLock in shared mode, reads ShmemVariableCache->nextFullXid), then separately causes GetNewTransactionId() (acquires XidGenLock exclusively, reads & writes nextFullXid). With for fsync=off (and also for synchronous_commit=off) the numbers are, at lower client counts, severly depressed and variable due to walwriter going completely nuts (using more CPU than the backend doing the queries). Because WAL writes are so fast on my storage, individual XLogBackgroundFlush() calls are very quick. This leads to a *lot* of kill()s from the backend, from within XLogSetAsyncXactLSN(). There got to be a bug here. But unrelated. > 2) Single insert statement (as example of very short transaction) CREATE TABLE testinsert(c1 int not null, c2 int not null, c3 int not null, c4 int not null); INSERT INTO testinsert VALUES(1, 2, 3, 4); -M prepared -T 360 fsync on: clients tps master tps pgxact 1 653 658 16 56875668 40 14212 14229 198 60483 62420 fsync off: clients tps master tps pgxact 1 59356 59891 16 290626 21 40 348210 355669 198 289182 291529 clients tps master tps pgxact 102447586 52135 -M simple fsync off: clients tps master tps pgxact 40 289077 326699 198 286011 299928 > 3) Plain pgbench read-write (you already did it for sure) -s 100 -M prepared -T 700 autovacuum=off, fsync on: clients tps master tps pgxact 1 474 479 16 43564476 40 85919309 198 20045 20261 102417986 18545 autovacuum=off, fsync off: clients tps master tps pgxact 1 78287719 16 49069 50482 40 68241 73081 198 73464 77801 102425621 28410 I chose autovacuum off because otherwise the results vary much more widely, and autovacuum isn't really needed for the workload. > 4) pgbench read-write script with increased amount of SELECTs. Repeat > select from pgbench_accounts say 10 times with different aids. I did intersperse all server-side statements in the script with two selects of other pgbench_account rows each. -s 100 -M prepared -T 700 autovacuum=off, fsync on: clients tps master tps pgxact 1 365 367 198 20065 21391 -s 1000 -M prepared -T 700 autovacuum=off, fsync on: clients tps master tps pgxact 16 27572880 40 47344996 198 16950 19998 102422423 24935 > 5) 10% pgbench read-write, 90% of pgbench read-only -s 100 -M prepared -T 100 -bselect-only@9 -btpcb-like@1 autovacuum=off, fsync on: clients tps master tps pgxact 16 37289 38656 40 81284 81260 198 189002 189357 1024143986 164762 > > That definitely needs to be measured, due to the locking changes around > > procarrayaddd/remove. > > > > I don't think regressions besides perhaps 2pc are likely - there's nothing > > really getting more expensive but procarray add/remove. > > I agree that ProcArrayAdd()/Remove() should be first subject of >
Re: Improving connection scalability: GetSnapshotData()
On Sun, 1 Mar 2020 at 21:47, Andres Freund wrote: > On 2020-03-01 00:36:01 -0800, Andres Freund wrote: > > conns tps mastertps pgxact-split > > > > 1 26842.49284526524.194821 > > 10 246923.158682 249224.782661 > > 50 695956.539704 709833.746374 > > 100 1054727.043139 1903616.306028 > > 200 964795.282957 1949200.338012 > > 300 906029.377539 1927881.231478 > > 400 845696.690912 1911065.369776 > > 500 812295.222497 1926237.255856 > > 600 888030.104213 1903047.236273 > > 700 866896.532490 1886537.202142 > > 800 863407.341506 1883768.592610 > > 900 871386.608563 1874638.012128 > > 1000887668.277133 1876402.391502 > > 1500860051.361395 1815103.564241 > > 2000890900.098657 1775435.271018 > > 3000874184.980039 1653953.817997 > > 4000845023.080703 1582582.316043 > > 5000817100.195728 1512260.802371 > > > > I think these are pretty nice results. FWIW, I took this for a spin on an AMD 3990x: # setup pgbench -i postgres #benchmark #!/bin/bash for i in 1 10 50 100 200 300 400 500 600 700 800 900 1000 1500 2000 3000 4000 5000; do echo Testing with $i connections >> bench.log pgbench2 -M prepared -c $i -j $i -S -n -T 60 postgres >> bench.log done pgbench2 is your patched version pgbench. I got some pretty strange results with the unpatched version. Up to about 50 million tps for excluding connection establishing, which seems pretty farfetched connectionsUnpatchedPatched 149062.2441349834.64983 10428673.1027453290.5985 501552413.0841849233.821 1002039675.0272261437.1 2003139648.8453632008.991 3003091248.3163597748.942 4003056453.53567888.293 5003019571.473574009.053 6002991052.3933537518.903 7002952484.7633553252.603 8002910976.8753539404.865 9002873929.9893514353.776 10002846859.4993490006.026 15002540003.0383370093.934 20002361799.1073197556.738 30002056973.7782949740.692 40001751418.1172627174.81 50001464786.4612334586.042 > Attached as a graph as well. Likewise. David
Re: Improving connection scalability: GetSnapshotData()
Hi, On 2020-03-31 13:04:38 -0700, Andres Freund wrote: > I'm still fighting with snapshot_too_old. The feature is just badly > undertested, underdocumented, and there's lots of other oddities. I've > now spent about as much time on that feature than on the whole rest of > the patchset. To expand on this being under-tested: The whole time mapping infrastructure is not tested, because all of that is bypassed when old_snapshot_threshold = 0. And old_snapshot_threshold = 0 basically only exists for testing. The largest part of the complexity of this feature are TransactionIdLimitedForOldSnapshots() and MaintainOldSnapshotTimeMapping() - and none of the complexity is tested due to the tests running with old_snapshot_threshold = 0. So we have test only infrastructure that doesn't allow to actually test the feature. And the tests that we do have don't have a single comment explaining what the expected results are. Except for the newer sto_using_hash_index.spec, they just run all permutations. I don't know how those tests actually help, since it's not clear why any of the results are the way they are. And which just are the results of bugs. Ore not affected by s_t_o. Greetings, Andres Freund