Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Jan de Visser
On April 22, 2015 11:14:08 AM Heikki Linnakangas wrote:
 On 04/16/2015 06:51 AM, Alvaro Herrera wrote:
  Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
  thought the -w would wait until promotion has taken effect, so there's
  no need to sleep additional time.
 
 -w is not supported with pg_ctl promote. Only start, stop and restart.
 It's accepted, but it doesn't do anything. Which isn't very nice...

I'm futzing with the pg_ctl cmd line parameters for my patch notifying pg_ctl 
that restart didn't work. I'll fix this as well.

 
 - Heikki

jan



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Heikki Linnakangas

On 04/16/2015 06:51 AM, Alvaro Herrera wrote:

Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
thought the -w would wait until promotion has taken effect, so there's
no need to sleep additional time.


-w is not supported with pg_ctl promote. Only start, stop and restart. 
It's accepted, but it doesn't do anything. Which isn't very nice...


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Heikki Linnakangas

On 04/20/2015 05:21 AM, Michael Paquier wrote:

I have just made a run of the TAP tests of pg_rewind on my raspberry
PI 1 (hamster), where the tests are very slow, and I noticed that it
takes up to 10s to get confirmation from standby that it has caught up
with the changes from master, and 5s to get confirmation that standby
has been promoted. So the tests in HEAD would be broken without this
patch, and 30s gives visibly enough room.


Ok, applied, with some refactoring: I put the sleep-retry loop into a 
separate function.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-22 Thread Michael Paquier
On Wed, Apr 22, 2015 at 8:40 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 On 04/20/2015 05:21 AM, Michael Paquier wrote:

 I have just made a run of the TAP tests of pg_rewind on my raspberry
 PI 1 (hamster), where the tests are very slow, and I noticed that it
 takes up to 10s to get confirmation from standby that it has caught up
 with the changes from master, and 5s to get confirmation that standby
 has been promoted. So the tests in HEAD would be broken without this
 patch, and 30s gives visibly enough room.


 Ok, applied, with some refactoring: I put the sleep-retry loop into a
 separate function.

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-19 Thread Michael Paquier
On Thu, Apr 16, 2015 at 1:00 PM, Michael Paquier wrote:
 Visibly that's not the case for this test case, the timing issues that
 we saw happened not because of the standby not catching up, but
 because of the promotion not taking effect in a timely fashion. And
 that's as well something I saw on my OSX box some days ago as well,
 explaining why the sleep time has been increased from 1 to 2 in
 53ba1077. This patch just takes it the careful way. In perl, system
 waits for the process of pg_ctl to exit before moving on, perhaps the
 problem is that pg_ctl thinks that promotion is done, but the node is
 not quite ready, explaining why the test fails.

I have just made a run of the TAP tests of pg_rewind on my raspberry
PI 1 (hamster), where the tests are very slow, and I noticed that it
takes up to 10s to get confirmation from standby that it has caught up
with the changes from master, and 5s to get confirmation that standby
has been promoted. So the tests in HEAD would be broken without this
patch, and 30s gives visibly enough room.

Note as well that I would like to enable the TAP tests on this
buildfarm machine btw. It may be good to get some input where things
are really slow even for other tests.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-15 Thread Alvaro Herrera
Michael Paquier wrote:

 However after discussion with a colleague we have noticed that those
 values may not be enough in slow environments, a value of up to 10s
 being sometimes needed after promotion to make tests pass.

Yeah, hardcoded sleep times are not reliable.  (/me would love to get
rid of hardcoded times in isolationtester timeout.spec test ...)

 Attached is a patch improving this sleep logic and doing the following things:
 1) To ensure that standby has caught up, check replay position on the
 standby and compare it with the current WAL position of master.

 2) To ensure that promotion is effective, use pg_is_in_recovery() and
 continue processing until we are sure that the standby is out of
 recovery.

Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
thought the -w would wait until promotion has taken effect, so there's
no need to sleep additional time.

 Note that this patch adds a small routine called command_result in
 TestLib.pm, able to return stdout, stderr and the exit code. That's
 really handy, and I am planning to use something like that as well in
 the replication test suite.

Makes sense to me.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-15 Thread Michael Paquier
Hi all,

TAP tests of pg_rewind are using in 2 places hardcoded values to wait
for a given amount of time for some events. In HEAD, those things are:
1) Wait for 1s for standby to catch up.
2) Wait for 2s for promotion of standby.
However after discussion with a colleague we have noticed that those
values may not be enough in slow environments, a value of up to 10s
being sometimes needed after promotion to make tests pass. And
actually the current way of doing is not reliable because it depends
on how the environment is able to handle quickly standby replay and
promotion (I am expecting issues regarding that on Windows btw, and on
small spec machines).

Attached is a patch improving this sleep logic and doing the following things:
1) To ensure that standby has caught up, check replay position on the
standby and compare it with the current WAL position of master.
2) To ensure that promotion is effective, use pg_is_in_recovery() and
continue processing until we are sure that the standby is out of
recovery.
In each case the patch attached makes a maximum of 30 attempts, each
attempt waiting 1s before processing, so we let a margin of up to 30s
for environments to handle replay and promotion properly.

Note that this patch adds a small routine called command_result in
TestLib.pm, able to return stdout, stderr and the exit code. That's
really handy, and I am planning to use something like that as well in
the replication test suite.
Regards,
-- 
Michael
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index e6a5b9b..86e9def 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -183,7 +183,7 @@ sub create_standby
 	# Base backup is taken with xlog files included
 	system_or_bail(pg_basebackup -D $test_standby_datadir -p $port_master -x $log_path 21);
 	append_to_file($test_standby_datadir/recovery.conf, qq(
-primary_conninfo='$connstr_master'
+primary_conninfo='$connstr_master application_name=rewind_standby'
 standby_mode=on
 recovery_target_timeline='latest'
 ));
@@ -191,8 +191,29 @@ recovery_target_timeline='latest'
 	# Start standby
 	system_or_bail(pg_ctl -w -D $test_standby_datadir -o \-k $tempdir_short --listen-addresses='' -p $port_standby\ start $log_path 21);
 
-	# sleep a bit to make sure the standby has caught up.
-	sleep 1;
+	# Wait until the standby has caught up with the primary by comparing
+	# WAL positions on both nodes. Note that this is fine
+	my $max_attempts = 30;
+	my $attempts = 0;
+	while ($attempts  $max_attempts)
+	{
+		# Wait a bit before proceeding.
+		sleep 1;
+		$attempts++;
+
+		my $query = SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';;
+		my $cmd = ['psql', '-At', '-c', $query, '-d', $connstr_master ];
+		my ($res, $stdout, $stderr) = command_result($cmd);
+		chomp($stdout);
+		if ($stdout eq t)
+		{
+			last;
+		}
+	}
+	if ($attempts == $max_attempts)
+	{
+		die Maximum number of attempts reached when waiting for standby to catch up;
+	}
 }
 
 sub promote_standby
@@ -201,9 +222,31 @@ sub promote_standby
 	# up standby
 
 	# Now promote slave and insert some new data on master, this will put
-	# the master out-of-sync with the standby.
+	# the master out-of-sync with the standby. Be sure that we leave here
+	# with a standby actually ready for the next operations.
 	system_or_bail(pg_ctl -w -D $test_standby_datadir promote $log_path 21);
-	sleep 2;
+	my $max_attempts = 30;
+	my $attempts = 0;
+	while ($attempts  $max_attempts)
+	{
+		# Wait a bit before proceeding, promotion may have not taken effect
+		# in such a short time.
+		sleep 1;
+		$attempts++;
+
+		my $query = SELECT pg_is_in_recovery();
+		my $cmd = ['psql', '-At', '-c', $query, '-d', $connstr_standby ];
+		my ($res, $stdout, $stderr) = command_result($cmd);
+		chomp($stdout);
+		if ($stdout eq f)
+		{
+			last;
+		}
+	}
+	if ($attempts == $max_attempts)
+	{
+		die Maximum number of attempts reached when waiting for promotion of standby;
+	}
 }
 
 sub run_pg_rewind
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 003cd9a..3c6a49e 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -16,6 +16,7 @@ our @EXPORT = qw(
   command_ok
   command_fails
   command_exit_is
+  command_result
   program_help_ok
   program_version_ok
   program_options_handling_ok
@@ -161,6 +162,14 @@ sub command_exit_is
 	is($h-result(0), $expected, $test_name);
 }
 
+sub command_result
+{
+	my ($cmd) = @_;
+	my ($stdout, $stderr);
+	my $result = run $cmd, '', \$stdout, '2', \$stderr;
+	return ($result, $stdout, $stderr);
+}
+
 sub program_help_ok
 {
 	my ($cmd) = @_;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

2015-04-15 Thread Michael Paquier
On Thu, Apr 16, 2015 at 12:51 PM, Alvaro Herrera wrote:
 Michael Paquier wrote:

 However after discussion with a colleague we have noticed that those
 values may not be enough in slow environments, a value of up to 10s
 being sometimes needed after promotion to make tests pass.

 Yeah, hardcoded sleep times are not reliable.  (/me would love to get
 rid of hardcoded times in isolationtester timeout.spec test ...)

 Attached is a patch improving this sleep logic and doing the following 
 things:
 1) To ensure that standby has caught up, check replay position on the
 standby and compare it with the current WAL position of master.

 2) To ensure that promotion is effective, use pg_is_in_recovery() and
 continue processing until we are sure that the standby is out of
 recovery.

 Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
 thought the -w would wait until promotion has taken effect, so there's
 no need to sleep additional time.

Visibly that's not the case for this test case, the timing issues that
we saw happened not because of the standby not catching up, but
because of the promotion not taking effect in a timely fashion. And
that's as well something I saw on my OSX box some days ago as well,
explaining why the sleep time has been increased from 1 to 2 in
53ba1077. This patch just takes it the careful way. In perl, system
waits for the process of pg_ctl to exit before moving on, perhaps the
problem is that pg_ctl thinks that promotion is done, but the node is
not quite ready, explaining why the test fails.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers