Re: slowest tap tests - split or accelerate?

2022-01-26 Thread Andres Freund
Hi,

On 2022-01-19 18:18:59 -0800, Andres Freund wrote:
> >  robocopy /e /NFL /NDL tmp_install\initdb_template t\
> >  575ms
> > 
> > So I guess we could use robocopy? That's shipped as part of windows 
> > starting in
> > windows 10... xcopy has been there for longer, so I might just default to 
> > that.
> 
> It's part of of the OS back to at least windows 2016. I've found some random
> links on the webs saying that it's included "This command is available in
> Vista and Windows 7 by default. For Windows XP and Server 2003 this tool can
> be downloaded as part of Server 2003 Windows Resource Kit tools. ".
> 
> Given that our oldest supported msvc version only runs on Windows 7 upwards
> [2], I think we should be good?

One thing I'm not sure about is where to perform the creation of the
"template" for the msvc scripts. The prototype upthread created it
unconditionally in Install.pm, but that's clearly not right.

The buildfarm currently creates the temporary installation using a generic
perl install.pl "$installdir" and then uses NO_TEMP_INSTALL.

I don't really have a better idea than to introduce a dedicated vcregress.pl
command to create the temporary installation? :(

Greetings,

Andres Freund




Re: slowest tap tests - split or accelerate?

2022-01-20 Thread Andrew Dunstan


On 1/19/22 21:18, Andres Freund wrote:
> Hi,
>
> On 2022-01-19 09:42:31 -0800, Andres Freund wrote:
>> Both ours have this. Unfortunately on windows cp doesn't natively
>> exist. Although git does provide it.  I tried a few things that appear to be
>> natively available (time is best of three executions):
>>
>>  gnu cp from git, cp -a tmp_install\initdb_template t\
>>  670ms
>>
>>  xcopy.exe /E /Q tmp_install\initdb_template t\
>>  638ms
> This errors out if there's any forward slashes in paths, thinking it's a
> flag. Seems out.
>
>
>>  robocopy /e /NFL /NDL tmp_install\initdb_template t\
>>  575ms
>>
>> So I guess we could use robocopy? That's shipped as part of windows starting 
>> in
>> windows 10... xcopy has been there for longer, so I might just default to 
>> that.
> It's part of of the OS back to at least windows 2016. I've found some random
> links on the webs saying that it's included "This command is available in
> Vista and Windows 7 by default. For Windows XP and Server 2003 this tool can
> be downloaded as part of Server 2003 Windows Resource Kit tools. ".
>
> Given that our oldest supported msvc version only runs on Windows 7 upwards
> [2], I think we should be good?
>
>
> Alternatively we could lift copydir() to src/common? But that seems like a bit
> more work than I want to put in.
>
>
> For a second I was thinking that using something like copy --reflink=auto
> could make a lot of sense for machines like florican, removing most of the IO
> from a "templated initdb". But it looks like freebsd doesn't have that, and
> it'd be a pain to figure out whether cp has --reflink.



FYI, the buildfarm code has this. It doesn't need backslashed paths, you
just need to quote the paths, which you should probably do anyway:


sub copydir
{
    my ($from, $to, $logfile) = @_;
    my ($cp, $rd);
    if ($PGBuild::conf{using_msvc})
    {
    $cp = "robocopy /nfl /ndl /np /e /sec ";
    $rd = qq{/LOG+:"$logfile" >nul};
    }
    else
    {
    $cp = "cp -r";
    $rd = qq{> "$logfile"};
    }
    system(qq{$cp "$from" "$to" $rd 2>&1});
    ## no critic (RequireLocalizedPunctuationVars)
    $? = 0 if ($cp =~ /robocopy/ && $? >> 8 == 1);
    return;
}

cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: slowest tap tests - split or accelerate?

2022-01-19 Thread Andres Freund
Hi,

On 2022-01-19 09:42:31 -0800, Andres Freund wrote:
> Both ours have this. Unfortunately on windows cp doesn't natively
> exist. Although git does provide it.  I tried a few things that appear to be
> natively available (time is best of three executions):
> 
>  gnu cp from git, cp -a tmp_install\initdb_template t\
>  670ms
> 
>  xcopy.exe /E /Q tmp_install\initdb_template t\
>  638ms

This errors out if there's any forward slashes in paths, thinking it's a
flag. Seems out.


>  robocopy /e /NFL /NDL tmp_install\initdb_template t\
>  575ms
> 
> So I guess we could use robocopy? That's shipped as part of windows starting 
> in
> windows 10... xcopy has been there for longer, so I might just default to 
> that.

It's part of of the OS back to at least windows 2016. I've found some random
links on the webs saying that it's included "This command is available in
Vista and Windows 7 by default. For Windows XP and Server 2003 this tool can
be downloaded as part of Server 2003 Windows Resource Kit tools. ".

Given that our oldest supported msvc version only runs on Windows 7 upwards
[2], I think we should be good?


Alternatively we could lift copydir() to src/common? But that seems like a bit
more work than I want to put in.


For a second I was thinking that using something like copy --reflink=auto
could make a lot of sense for machines like florican, removing most of the IO
from a "templated initdb". But it looks like freebsd doesn't have that, and
it'd be a pain to figure out whether cp has --reflink.

Greetings,

Andres Freund

[1] https://www.windows-commandline.com/download-robocopy/
[2] 
https://docs.microsoft.com/en-us/visualstudio/releases/2013/vs2013-sysrequirements-vs




Re: slowest tap tests - split or accelerate?

2022-01-19 Thread Andres Freund
Hi,

On 2022-01-19 12:14:21 -0500, Tom Lane wrote:
> No, it was largely the same as what you have here, I think.  I dug
> up my WIP patch and attach it below, just in case there's any ideas
> worth borrowing.

Heh, it does look quite similar.


> +  "cp -a \"%s\" \"%s/data\" > 
> \"%s/log/initdb.log\" 2>&1",
> +  pg_proto_datadir,
> +  temp_instance,
> +  outputdir);
> + if (system(buf))
> + {
> + fprintf(stderr, _("\n%s: cp failed\nExamine 
> %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, 
> buf);
> + exit(2);
> + }

Both ours have this. Unfortunately on windows cp doesn't natively
exist. Although git does provide it.  I tried a few things that appear to be
natively available (time is best of three executions):

 gnu cp from git, cp -a tmp_install\initdb_template t\
 670ms

 xcopy.exe /E /Q tmp_install\initdb_template t\
 638ms

 robocopy /e /NFL /NDL tmp_install\initdb_template t\
 575ms

So I guess we could use robocopy? That's shipped as part of windows starting in
windows 10... xcopy has been there for longer, so I might just default to that.

Greetings,

Andres Freund




Re: slowest tap tests - split or accelerate?

2022-01-19 Thread Tom Lane
Andres Freund  writes:
> On 2022-01-19 11:54:01 -0500, Tom Lane wrote:
>> Me too ;-).  As I remarked earlier, I'd tried this once before and
>> gave up because it didn't seem to be winning much.  But that was
>> before we had so many initdb's triggered by TAP tests, I think.

> What approach did you use? Do you have a better idea than generating
> tmp_install/initdb_template?

No, it was largely the same as what you have here, I think.  I dug
up my WIP patch and attach it below, just in case there's any ideas
worth borrowing.

>> (Note that all four runs have the "fsync = on" removed from
>> 008_fsm_truncation.pl.)

> I assume you're planning on comitting that?

Yeah, will do that shortly.

regards, tom lane

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index dc8a89af8e..a28a03ac72 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -332,6 +332,7 @@ ifeq ($(MAKELEVEL),0)
 	rm -rf '$(abs_top_builddir)'/tmp_install
 	$(MKDIR_P) '$(abs_top_builddir)'/tmp_install/log
 	$(MAKE) -C '$(top_builddir)' DESTDIR='$(abs_top_builddir)'/tmp_install install >'$(abs_top_builddir)'/tmp_install/log/install.log 2>&1
+	'$(abs_top_builddir)/tmp_install$(bindir)/initdb' -D '$(abs_top_builddir)/tmp_install/proto_pgdata' --no-clean --no-sync -A trust >'$(abs_top_builddir)'/tmp_install/log/initdb.log 2>&1
 endif
 	$(if $(EXTRA_INSTALL),for extra in $(EXTRA_INSTALL); do $(MAKE) -C '$(top_builddir)'/$$extra DESTDIR='$(abs_top_builddir)'/tmp_install install >>'$(abs_top_builddir)'/tmp_install/log/install.log || exit; done)
 endif
@@ -355,7 +356,7 @@ $(if $(filter $(PORTNAME),darwin),DYLD_LIBRARY_PATH,$(if $(filter $(PORTNAME),ai
 endef
 
 define with_temp_install
-PATH="$(abs_top_builddir)/tmp_install$(bindir):$$PATH" $(call add_to_path,$(ld_library_path_var),$(abs_top_builddir)/tmp_install$(libdir))
+PATH="$(abs_top_builddir)/tmp_install$(bindir):$$PATH" $(call add_to_path,$(ld_library_path_var),$(abs_top_builddir)/tmp_install$(libdir)) PG_PROTO_DATADIR='$(abs_top_builddir)/tmp_install/proto_pgdata'
 endef
 
 ifeq ($(enable_tap_tests),yes)
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index f4fa600951..22c1a726c7 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -404,8 +404,23 @@ sub init
 	mkdir $self->backup_dir;
 	mkdir $self->archive_dir;
 
-	TestLib::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N',
-		@{ $params{extra} });
+	# If we're using default initdb parameters, and the top-level "make check"
+	# created a prototype data directory for us, just copy that.
+	if (!defined($params{extra}) &&
+		defined($ENV{PG_PROTO_DATADIR}) &&
+		-d $ENV{PG_PROTO_DATADIR})
+	{
+		rmdir($pgdata);
+		RecursiveCopy::copypath($ENV{PG_PROTO_DATADIR}, $pgdata);
+		chmod(0700, $pgdata);
+	}
+	else
+	{
+		TestLib::system_or_bail('initdb', '-D', $pgdata,
+'--no-clean', '--no-sync', '-A', 'trust',
+@{ $params{extra} });
+	}
+
 	TestLib::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata);
 
 	open my $conf, '>>', "$pgdata/postgresql.conf";
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index abb742b1ed..04d7fb910b 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2214,6 +2214,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 
 	if (temp_instance)
 	{
+		char	   *pg_proto_datadir;
+		struct stat ppst;
 		FILE	   *pg_conf;
 		const char *env_wait;
 		int			wait_seconds;
@@ -2243,20 +2245,43 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		if (!directory_exists(buf))
 			make_directory(buf);
 
-		/* initdb */
-		header(_("initializing database system"));
-		snprintf(buf, sizeof(buf),
- "\"%s%sinitdb\" -D \"%s/data\" --no-clean --no-sync%s%s > \"%s/log/initdb.log\" 2>&1",
- bindir ? bindir : "",
- bindir ? "/" : "",
- temp_instance,
- debug ? " --debug" : "",
- nolocale ? " --no-locale" : "",
- outputdir);
-		if (system(buf))
+		/* see if we should run initdb or just copy a prototype datadir */
+		pg_proto_datadir = getenv("PG_PROTO_DATADIR");
+		if (!nolocale &&
+			pg_proto_datadir &&
+			stat(pg_proto_datadir, ) == 0 &&
+			S_ISDIR(ppst.st_mode))
 		{
-			fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, buf);
-			exit(2);
+			/* copy prototype */
+			header(_("copying prototype data directory"));
+			snprintf(buf, sizeof(buf),
+	 "cp -a \"%s\" \"%s/data\" > \"%s/log/initdb.log\" 2>&1",
+	 pg_proto_datadir,
+	 temp_instance,
+	 outputdir);
+			if (system(buf))
+			{
+fprintf(stderr, _("\n%s: cp failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, buf);
+exit(2);
+			}
+		}
+		else
+		{
+			/* run initdb */
+			header(_("initializing database system"));
+			snprintf(buf, sizeof(buf),
+	 

Re: slowest tap tests - split or accelerate?

2022-01-19 Thread Andres Freund
Hi,

On 2022-01-19 11:54:01 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-01-18 13:40:40 -0800, Andres Freund wrote:
> >> Maybe we really should do at least the most simplistic caching for 
> >> initdbs, by
> >> doing one initdb as part of the creation of temp_install. Then 
> >> Cluster::init
> >> would need logic to only use that if $params{extra} is empty.
>
> > I hacked this together. And the wins are bigger than I thought.
>
> Me too ;-).  As I remarked earlier, I'd tried this once before and
> gave up because it didn't seem to be winning much.  But that was
> before we had so many initdb's triggered by TAP tests, I think.

What approach did you use? Do you have a better idea than generating
tmp_install/initdb_template?

I for a bit wondered whether initdb should do this internally instead. But it
seemed more work than I wanted to tackle.

The bit in the patch about generating initdb_template in Install.pm definitely
needs to be made conditional, but I don't precisely know on what. The
buildfarm just calls it as
  perl install.pl "$installdir


> So this looks like it'll be a nice win for low-end hardware, too.

Nice!


> (Note that all four runs have the "fsync = on" removed from
> 008_fsm_truncation.pl.)

I assume you're planning on comitting that?

Greetings,

Andres Freund




Re: slowest tap tests - split or accelerate?

2022-01-19 Thread Tom Lane
Andres Freund  writes:
> On 2022-01-18 13:40:40 -0800, Andres Freund wrote:
>> Maybe we really should do at least the most simplistic caching for initdbs, 
>> by
>> doing one initdb as part of the creation of temp_install. Then Cluster::init
>> would need logic to only use that if $params{extra} is empty.

> I hacked this together. And the wins are bigger than I thought.

Me too ;-).  As I remarked earlier, I'd tried this once before and
gave up because it didn't seem to be winning much.  But that was
before we had so many initdb's triggered by TAP tests, I think.

I tried this patch on florican's host, which seems mostly disk-bound
when doing check-world.  It barely gets any win from parallelism:

$ time make -s check-world -j1 >/dev/null
 3809.60 real   584.44 user   282.23 sys
$ time make -s check-world -j2 >/dev/null
 3789.90 real   610.60 user   289.60 sys

Adding v2-0001-hack-use-template-initdb-in-tap-tests.patch:

$ time make -s check-world -j1 >/dev/null
 3193.46 real   221.32 user   226.11 sys
$ time make -s check-world -j2 >/dev/null
 3211.19 real   224.31 user   230.07 sys

(Note that all four runs have the "fsync = on" removed from
008_fsm_truncation.pl.)

So this looks like it'll be a nice win for low-end hardware, too.

regards, tom lane




Re: slowest tap tests - split or accelerate?

2022-01-18 Thread Andres Freund
Hi,

On 2022-01-18 13:40:40 -0800, Andres Freund wrote:
> Maybe we really should do at least the most simplistic caching for initdbs, by
> doing one initdb as part of the creation of temp_install. Then Cluster::init
> would need logic to only use that if $params{extra} is empty.

I hacked this together. And the wins are bigger than I thought. On my
workstation, with plenty cpu and storage bandwidth, according to
  /usr/bin/time check-world NO_TEMP_INSTALL=1
things go from

  321.56user 74.00system 2:26.22elapsed 270%CPU (0avgtext+0avgdata 
93768maxresident)k
  24inputs+32781336outputs (2254major+8717121minor)pagefaults 0swaps

to

  86.62user 57.10system 1:57.83elapsed 121%CPU (0avgtext+0avgdata 
93752maxresident)k
  8inputs+32683408outputs (1360major+6672618minor)pagefaults 0swaps

The difference in elapsed and system time is pretty good, but the user time
difference is quite staggering.


This doesn't yet actually address the case of the basebackup tests, because
that specifies a "non-default" option, preventing the use of the template
initdb. But the effects are already big enough that I thought it's worth
sharing.

On CI for windows this reduces the time for the subscription tests from 03:24,
to 2:39. There's some run-to-run variation, but it's a pretty clear signal...

Greetings,

Andres Freund
>From a0818b6c8d03f152baa5df231b27aa7b8a7fde45 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 18 Jan 2022 15:25:27 -0800
Subject: [PATCH v2] hack: use "template" initdb in tap tests.

---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 31 +-
 src/test/regress/pg_regress.c| 45 ++--
 src/Makefile.global.in   | 52 +---
 src/tools/msvc/Install.pm|  4 ++
 src/tools/msvc/vcregress.pl  |  1 +
 5 files changed, 94 insertions(+), 39 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 7af0f8db139..7e235c90d8c 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -430,8 +430,35 @@ sub init
 	mkdir $self->backup_dir;
 	mkdir $self->archive_dir;
 
-	PostgreSQL::Test::Utils::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N',
-		@{ $params{extra} });
+	if (defined $params{extra} or !defined $ENV{INITDB_TEMPLATE})
+	{
+		diag("*not* using initdb template");
+		PostgreSQL::Test::Utils::system_or_bail('initdb', '-D', $pgdata, '-A', 'trust', '-N',
+			@{ $params{extra} });
+	}
+	else
+	{
+		my $old_umask;
+
+		diag("using initdb template");
+
+		if (!$PostgreSQL::Test::Utils::windows_os)
+		{
+			$old_umask = umask;
+			umask 0077;
+		}
+
+		PostgreSQL::Test::RecursiveCopy::copypath(
+			$ENV{INITDB_TEMPLATE},
+			$pgdata,
+		  );
+
+		if (!$PostgreSQL::Test::Utils::windows_os)
+		{
+			umask $old_umask;
+		}
+	}
+
 	PostgreSQL::Test::Utils::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata,
 		@{ $params{auth_extra} });
 
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index e6f71c7582e..07f1beae6c6 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2232,6 +2232,7 @@ regression_main(int argc, char *argv[],
 		FILE	   *pg_conf;
 		const char *env_wait;
 		int			wait_seconds;
+		const char* initdb_template_dir;
 
 		/*
 		 * Prepare the temp instance
@@ -2258,20 +2259,38 @@ regression_main(int argc, char *argv[],
 		if (!directory_exists(buf))
 			make_directory(buf);
 
-		/* initdb */
-		header(_("initializing database system"));
-		snprintf(buf, sizeof(buf),
- "\"%s%sinitdb\" -D \"%s/data\" --no-clean --no-sync%s%s > \"%s/log/initdb.log\" 2>&1",
- bindir ? bindir : "",
- bindir ? "/" : "",
- temp_instance,
- debug ? " --debug" : "",
- nolocale ? " --no-locale" : "",
- outputdir);
-		if (system(buf))
+		/* create data directory, either from a template, or by running initdb */
+		initdb_template_dir = getenv("INITDB_TEMPLATE");
+		if (initdb_template_dir == NULL)
 		{
-			fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, buf);
-			exit(2);
+			header(_("initializing database system"));
+			snprintf(buf, sizeof(buf),
+	 "\"%s%sinitdb\" -D \"%s/data\" --no-clean --no-sync%s%s > \"%s/log/initdb.log\" 2>&1",
+	 bindir ? bindir : "",
+	 bindir ? "/" : "",
+	 temp_instance,
+	 debug ? " --debug" : "",
+	 nolocale ? " --no-locale" : "",
+	 outputdir);
+			if (system(buf))
+			{
+fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, buf);
+exit(2);
+			}
+		}
+		else
+		{
+			/* FIXME: this only works on windows when there's a GNU cp in PATH */
+			header(_("initializing database system from template"));
+			snprintf(buf, sizeof(buf),
+	 "cp -a \"%s\" \"%s/data\"",
+	 initdb_template_dir,
+	 temp_instance);
+			if 

Re: slowest tap tests - split or accelerate?

2022-01-18 Thread Andres Freund
Hi,

On 2022-01-18 12:49:16 -0500, Robert Haas wrote:
> Here's a patch that splits up that file.

Ah, nice! The split seems sensible to me.


> Sadly, we've gained about 2.5 seconds of runtime as the price of
> splitting the test. Arguably the options part could be split up a lot
> more finely than this, but that would drive up the runtime even more,
> basically because we'd need more initdbs. So I don't know whether it's
> better to leave things as they are, split them this much, or split
> them more. I think this amount of splitting might be justified simply
> in the interests of clarity, but I'm reluctant to go further unless we
> get some nifty initdb-caching system.

Hm. From the buildfarm / CF perspective it might still be a win, because the
different pieces can run concurrently. But it's not great :(.

Maybe we really should do at least the most simplistic caching for initdbs, by
doing one initdb as part of the creation of temp_install. Then Cluster::init
would need logic to only use that if $params{extra} is empty.

Greetings,

Andres Freund




Re: slowest tap tests - split or accelerate?

2022-01-18 Thread Robert Haas
On Mon, Jan 17, 2022 at 2:57 PM Andres Freund  wrote:
> > pg_basebackup's 010_pg_basebackup.pl looks like it could be split up,
> > though. That one, at least to me, looks like people have just kept
> > adding semi-related things into the same test file.
>
> Yea.

Here's a patch that splits up that file. Essentially the first half of
the file is concerned with testing that a backup ends up in the state
it expects, while the second half is concerned with checking that
various options to pg_basebackup work. So I split it that way, plus I
moved some of the really basic stuff to a completely separate file
with a very brief runtime. The test results are interesting.

Unpatched:

[12:33:33] t/010_pg_basebackup.pl ... ok16161 ms ( 0.02 usr  0.00
sys +  2.07 cusr  7.80 csys =  9.89 CPU)
[12:33:49] t/020_pg_receivewal.pl ... ok 4115 ms ( 0.00 usr  0.00
sys +  0.89 cusr  1.73 csys =  2.62 CPU)
[12:33:53] t/030_pg_recvlogical.pl .. ok 1857 ms ( 0.01 usr  0.01
sys +  0.63 cusr  0.73 csys =  1.38 CPU)
[12:33:55]
All tests successful.
Files=3, Tests=177, 22 wallclock secs ( 0.04 usr  0.02 sys +  3.59
cusr 10.26 csys = 13.91 CPU)

Pached:

[12:32:03] t/010_pg_basebackup_basic.pl .. ok  192 ms ( 0.01
usr  0.00 sys +  0.10 cusr  0.05 csys =  0.16 CPU)
[12:32:03] t/011_pg_basebackup_integrity.pl .. ok 5530 ms ( 0.00
usr  0.00 sys +  0.87 cusr  2.51 csys =  3.38 CPU)
[12:32:09] t/012_pg_basebackup_options.pl  ok13117 ms ( 0.00
usr  0.00 sys +  1.87 cusr  6.31 csys =  8.18 CPU)
[12:32:22] t/020_pg_receivewal.pl  ok 4314 ms ( 0.01
usr  0.00 sys +  0.97 cusr  1.77 csys =  2.75 CPU)
[12:32:26] t/030_pg_recvlogical.pl ... ok 1908 ms ( 0.00
usr  0.00 sys +  0.64 cusr  0.77 csys =  1.41 CPU)
[12:32:28]
All tests successful.
Files=5, Tests=177, 25 wallclock secs ( 0.04 usr  0.02 sys +  4.45
cusr 11.41 csys = 15.92 CPU)

Sadly, we've gained about 2.5 seconds of runtime as the price of
splitting the test. Arguably the options part could be split up a lot
more finely than this, but that would drive up the runtime even more,
basically because we'd need more initdbs. So I don't know whether it's
better to leave things as they are, split them this much, or split
them more. I think this amount of splitting might be justified simply
in the interests of clarity, but I'm reluctant to go further unless we
get some nifty initdb-caching system.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-Split-010_pg_basebackup.pl.patch
Description: Binary data


Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Andres Freund
Hi,

On 2022-01-17 15:13:57 -0500, Robert Haas wrote:
> I guess there must be something explaining it, but I don't know what
> it could be. The client and the server are each running the checksum
> algorithm over the same data. If that's not the same speed then  I
> don't get it. Unless, somehow, they're using different implementations
> of it?

I think that actually might be the issue. On linux a test a pg_verifybackup
was much faster than on windows (as in 10x). But if I disable openssl, it's
only 2x.

On the windows instance I *do* have openssl enabled. But I suspect something
is off and the windows buildsystem ends up with our hand-rolled implementation
on the client side, but not the server side. Which'd explain the times I'm
seeing: We have a fast CRC implementation, but the rest is pretty darn
unoptimized.

Greetings,

Andres Freund




Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Tom Lane
Andres Freund  writes:
> I've occasionally pondered caching initdb results and reusing them across
> tests - just the locking around it seems a bit nasty, but perhaps that could
> be done as part of the tmp_install step. Of course, it'd need to deal with
> different options etc...

I'd actually built a prototype to do that, based on making a reference
cluster and then "cp -a"'ing it instead of re-running initdb.  I gave
up when I found than on slower, disk-bound machines it was hardly
any faster.  Thinking about it now, I wonder why not just re-use one
cluster for many tests, only dropping and re-creating the database
in which the testing happens.

regards, tom lane




Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Robert Haas
On Mon, Jan 17, 2022 at 2:57 PM Andres Freund  wrote:
> I wonder if there's something explaining why pg_verifybackup is greatly slowed
> down by sha224 but not crc32c, but the server's runtime only differs by ~20ms?
> It seems incongruous that pg_basebackup, with all the complexity of needing to
> communicate with the server, transferring the backup over network, and *also*
> computing checksums, takes as long as the pg_verifybackup invocation?

I guess there must be something explaining it, but I don't know what
it could be. The client and the server are each running the checksum
algorithm over the same data. If that's not the same speed then  I
don't get it. Unless, somehow, they're using different implementations
of it?

> I've occasionally pondered caching initdb results and reusing them across
> tests - just the locking around it seems a bit nasty, but perhaps that could
> be done as part of the tmp_install step. Of course, it'd need to deal with
> different options etc...

It's a thought, but it does seem like a bit of a pain to implement.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Andres Freund
Hi,

On 2022-01-17 14:05:17 -0500, Robert Haas wrote:
> On Mon, Jan 17, 2022 at 1:41 PM Andres Freund  wrote:
> > The reason these in particular are slow is that they do a lot of
> > pg_basebackups without either / one-of -cfast / --no-sync. The lack of 
> > -cfast
> > in particularly is responsible for a significant proportion of the test
> > time. The only reason this didn't cause the tests to take many minutes is 
> > that
> > spread checkpoints only throttle when writing out a buffer and there aren't
> > that many dirty buffers...
>
> Adding -cfast to 002_algorithm.pl seems totally reasonable. I'm not
> sure what else can realistically be done to speed it up without losing
> the point of the test. And it's basically just a single loop, so
> splitting it up doesn't seem to make a lot of sense either.

It's also not that slow compared other tests after the -cfast addition.

However, I'm a bit surprised at how long the individual pg_verifybackup calls
take on windows - about as long as the pg_basebackup call itself.

 Running: pg_basebackup -D 
C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/sha224
 --manifest-checksums sha224 --no-sync -cfast
# timing: [4.798 + 0.704s]: complete
# Running: pg_verifybackup -e 
C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/sha224
backup successfully verified
# timing: [5.507 + 0.697s]: completed


Interestingly, with crc32c, this is not so:

# Running: pg_basebackup -D 
C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/crc32c
 --manifest-checksums crc32c --no-sync -cfast
# timing: [3.500 + 0.688s]: completed
ok 5 - backup ok with algorithm "crc32c"
ok 6 - crc32c is mentioned many times in the manifest
# Running: pg_verifybackup -e 
C:/dev/postgres/.\src\bin\pg_verifybackup\/tmp_check/t_002_algorithm_primary_data/backup/crc32c
backup successfully verified
# timing: [4.194 + 0.197s]: completed


I wonder if there's something explaining why pg_verifybackup is greatly slowed
down by sha224 but not crc32c, but the server's runtime only differs by ~20ms?
It seems incongruous that pg_basebackup, with all the complexity of needing to
communicate with the server, transferring the backup over network, and *also*
computing checksums, takes as long as the pg_verifybackup invocation?


> pg_basebackup's 010_pg_basebackup.pl looks like it could be split up,
> though. That one, at least to me, looks like people have just kept
> adding semi-related things into the same test file.


Yea.


It's generally interesting how much time initdb takes in these tests. It's
about 1.1s on my linux workstation, and 2.1s on windows.

I've occasionally pondered caching initdb results and reusing them across
tests - just the locking around it seems a bit nasty, but perhaps that could
be done as part of the tmp_install step. Of course, it'd need to deal with
different options etc...

Greetings,

Andres Freund




Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Robert Haas
On Mon, Jan 17, 2022 at 1:41 PM Andres Freund  wrote:
> The reason these in particular are slow is that they do a lot of
> pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast
> in particularly is responsible for a significant proportion of the test
> time. The only reason this didn't cause the tests to take many minutes is that
> spread checkpoints only throttle when writing out a buffer and there aren't
> that many dirty buffers...

Adding -cfast to 002_algorithm.pl seems totally reasonable. I'm not
sure what else can realistically be done to speed it up without losing
the point of the test. And it's basically just a single loop, so
splitting it up doesn't seem to make a lot of sense either.

pg_basebackup's 010_pg_basebackup.pl looks like it could be split up,
though. That one, at least to me, looks like people have just kept
adding semi-related things into the same test file.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: slowest tap tests - split or accelerate?

2022-01-17 Thread Andres Freund
Hi,

On 2021-12-31 11:25:28 -0800, Andres Freund wrote:
> cfbot now runs most tests on windows, the windows task is by far the slowest,
> and the task limitted most in concurrency [2]. Running tap tests is the
> biggest part of that. This is a bigger issue on windows because we don't have
> infrastructure (yet) to run tests in parallel.
> 
> There's a few tests which stand out in their slowness, which seem worth
> addressing even if we tackle test parallelism on windows at some point. I
> often find them to be the slowest tests on linux too.
> 
> Picking a random successful cfbot run [1] I see the following tap tests taking
> more than 20 seconds:
> 
> 67188 ms pg_basebackup t/010_pg_basebackup.pl
> 25751 ms pg_verifybackup t/002_algorithm.pl

The reason these in particular are slow is that they do a lot of
pg_basebackups without either / one-of -cfast / --no-sync. The lack of -cfast
in particularly is responsible for a significant proportion of the test
time. The only reason this didn't cause the tests to take many minutes is that
spread checkpoints only throttle when writing out a buffer and there aren't
that many dirty buffers...

Attached is a patch changing the parameters in all the instances I
found. Testing on a local instance it about halves the runtime of
t/010_pg_basebackup.pl on linux and windows (but there's still a 2x time
difference between the two), it's less when running the tests concurrently CI.

It might be worth having one explicit use of -cspread. Perhaps combined with
an explicit checkpoint beforehand?

Greetings,

Andres Freund
>From 4abed85e9085786f82d87667f2451821c01d37c0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 17 Jan 2022 00:51:43 -0800
Subject: [PATCH v1 1/4] tests: Consistently use pg_basebackup -cfast --no-sync
 to accelerate tests.

Most tests invoking pg_basebackup themselves did not yet use -cfast, which
makes pg_basebackup take considerably longer. The only reason this didn't
cause the tests to take many minutes is that spread checkpoints only throttle
when writing out a buffer and there aren't that many dirty buffers...

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 104 ++-
 src/bin/pg_verifybackup/t/002_algorithm.pl   |   2 +-
 src/bin/pg_verifybackup/t/003_corruption.pl  |   2 +-
 src/bin/pg_verifybackup/t/004_options.pl |   2 +-
 src/bin/pg_verifybackup/t/006_encoding.pl|   2 +-
 src/bin/pg_verifybackup/t/007_wal.pl |   4 +-
 6 files changed, 61 insertions(+), 55 deletions(-)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 872fec3d350..ebd102cd098 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -20,6 +20,13 @@ my $tempdir = PostgreSQL::Test::Utils::tempdir;
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 
+# For nearly all tests some options should be specified, to keep test times
+# reasonable. Using @pg_basebackup_defs as the first element of the array
+# passed to to IPC::Run interpolates the array (as it is not a reference to an
+# array)..
+my @pg_basebackup_defs = ('pg_basebackup', '--no-sync', '-cfast');
+
+
 # Set umask so test directories and files are created with default permissions
 umask(0077);
 
@@ -43,7 +50,7 @@ $node->set_replication_conf();
 system_or_bail 'pg_ctl', '-D', $pgdata, 'reload';
 
 $node->command_fails(
-	[ 'pg_basebackup', '-D', "$tempdir/backup" ],
+	[ @pg_basebackup_defs, '-D', "$tempdir/backup" ],
 	'pg_basebackup fails because of WAL configuration');
 
 ok(!-d "$tempdir/backup", 'backup directory was cleaned up');
@@ -54,7 +61,7 @@ mkdir("$tempdir/backup")
   or BAIL_OUT("unable to create $tempdir/backup");
 append_to_file("$tempdir/backup/dir-not-empty.txt", "Some data");
 
-$node->command_fails([ 'pg_basebackup', '-D', "$tempdir/backup", '-n' ],
+$node->command_fails([ @pg_basebackup_defs, '-D', "$tempdir/backup", '-n' ],
 	'failing run with no-clean option');
 
 ok(-d "$tempdir/backup", 'backup directory was created and left behind');
@@ -105,7 +112,7 @@ foreach my $filename (@tempRelationFiles)
 }
 
 # Run base backup.
-$node->command_ok([ 'pg_basebackup', '-D', "$tempdir/backup", '-X', 'none' ],
+$node->command_ok([ @pg_basebackup_defs, '-D', "$tempdir/backup", '-X', 'none' ],
 	'pg_basebackup runs');
 ok(-f "$tempdir/backup/PG_VERSION",  'backup was created');
 ok(-f "$tempdir/backup/backup_manifest", 'backup manifest included');
@@ -165,9 +172,9 @@ rmtree("$tempdir/backup");
 
 $node->command_ok(
 	[
-		'pg_basebackup','-D',
-		"$tempdir/backup2", '--no-manifest',
-		'--waldir', "$tempdir/xlog2"
+		@pg_basebackup_defs, '-D',
+		"$tempdir/backup2",  '--no-manifest',
+		'--waldir',  "$tempdir/xlog2"
 	],
 	'separate xlog directory');
 ok(-f "$tempdir/backup2/PG_VERSION",   'backup was created');
@@ -176,31 +183,31 @@ ok(-d "$tempdir/xlog2/", 

slowest tap tests - split or accelerate?

2021-12-31 Thread Andres Freund
Hi,

cfbot now runs most tests on windows, the windows task is by far the slowest,
and the task limitted most in concurrency [2]. Running tap tests is the
biggest part of that. This is a bigger issue on windows because we don't have
infrastructure (yet) to run tests in parallel.

There's a few tests which stand out in their slowness, which seem worth
addressing even if we tackle test parallelism on windows at some point. I
often find them to be the slowest tests on linux too.

Picking a random successful cfbot run [1] I see the following tap tests taking
more than 20 seconds:

67188 ms pg_basebackup t/010_pg_basebackup.pl
59710 ms recovery t/001_stream_rep.pl
57542 ms pg_rewind t/001_basic.pl
56179 ms subscription t/001_rep_changes.pl
42146 ms pgbench t/001_pgbench_with_server.pl
38264 ms recovery t/018_wal_optimize.pl
33642 ms subscription t/013_partition.pl
29129 ms pg_dump t/002_pg_dump.pl
25751 ms pg_verifybackup t/002_algorithm.pl
20628 ms subscription t/011_generated.pl

It would be good if we could make those tests faster, or if not easily
possible, at least split those tests into smaller tap tests.

Splitting a longer test into smaller ones is preferrable even if they take the
same time, because we can use prove level concurrency on windows to gain some
test parallelism. As a nice side-effect it makes it also quicker to run a
split test isolated during development.

Greetings,

Andres Freund

[1] https://cirrus-ci.com/task/5207126145499136
[2] https://cirrus-ci.org/faq/#are-there-any-limits