Re: [HACKERS] Further stabilization of isolationtester's timeouts test

2016-05-30 Thread Mikael Kjellström



On 2016-05-27 04:06, Tom Lane wrote:


In this case the process seems to have gone to sleep for just short of a
minute rather than the expected 5 seconds.  Presumably that just reflects
overload on the buildfarm member rather than anything really exciting,
but it explains the failure nicely: by the time we got to postgres.c's
ProcessInterrupts(), both the lock and statement timeouts had expired,
and the code there preferentially reports "lock timeout" in that case.


Just wanted to chip in and say that it's almost certain due to 
overloading.  It's a virtual server (VMWare) that runs 4 build animals 
and they all where scheduled to run at the same time and it's on 
spinning rust (i.e. HDD) so it will get overloaded easilly.


I've changed the schedules of the 4 animals so that they shouldn't 
overloap so from now on it should hopefully be much better.


/Mikael


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Further stabilization of isolationtester's timeouts test

2016-05-26 Thread Tom Lane
I've been noticing recently that certain buildfarm members sometimes fail
the "timeouts" isolation test with this symptom:

*** 
/home/pgbf/buildroot/HEAD/pgsql.build/src/test/isolation/expected/timeouts.out  
Mon May 16 23:45:12 2016
--- 
/home/pgbf/buildroot/HEAD/pgsql.build/src/test/isolation/results/timeouts.out   
Mon May 16 23:53:08 2016
***
*** 70,73 
  step slto: SET lock_timeout = 6000; SET statement_timeout = 5000;
  step update: DELETE FROM accounts WHERE accountid = 'checking'; 
  step update: <... completed>
! ERROR:  canceling statement due to statement timeout
--- 70,73 
  step slto: SET lock_timeout = 6000; SET statement_timeout = 5000;
  step update: DELETE FROM accounts WHERE accountid = 'checking'; 
  step update: <... completed>
! ERROR:  canceling statement due to lock timeout

as for example here and here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder=2016-05-16%2021%3A45%3A06
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder=2016-05-26%2016%3A45%3A03

A look at the postmaster log is informative:

2016-05-16 23:52:12.612 CEST [573a40e9.24a8:26] LOG:  statement: BEGIN 
ISOLATION LEVEL READ COMMITTED;
2016-05-16 23:52:12.612 CEST [573a40e9.10ef:50] LOG:  statement: BEGIN 
ISOLATION LEVEL READ COMMITTED;
2016-05-16 23:52:12.613 CEST [573a40e9.24a8:27] LOG:  statement: UPDATE 
accounts SET balance = balance + 100;
2016-05-16 23:52:12.613 CEST [573a40e9.10ef:51] LOG:  statement: SET 
lock_timeout = 6000; SET statement_timeout = 5000;
2016-05-16 23:52:12.614 CEST [573a40e9.10ef:52] LOG:  statement: DELETE FROM 
accounts WHERE accountid = 'checking';
2016-05-16 23:52:12.631 CEST [573a40e9.5afe:34] LOG:  execute 
isolationtester_waiting: SELECT pg_catalog.pg_blocking_pids($1) && 
'{9384,4335}'::integer[]
2016-05-16 23:52:12.631 CEST [573a40e9.5afe:35] DETAIL:  parameters: $1 = '4335'
2016-05-16 23:53:08.658 CEST [573a40e9.10ef:53] ERROR:  canceling statement due 
to lock timeout
2016-05-16 23:53:08.658 CEST [573a40e9.10ef:54] CONTEXT:  while deleting tuple 
(0,1) in relation "accounts"
2016-05-16 23:53:08.658 CEST [573a40e9.10ef:55] STATEMENT:  DELETE FROM 
accounts WHERE accountid = 'checking';
2016-05-16 23:53:08.659 CEST [573a40e9.24a8:28] LOG:  statement: ABORT;
2016-05-16 23:53:08.659 CEST [573a40e9.10ef:56] LOG:  statement: ABORT;

In this case the process seems to have gone to sleep for just short of a
minute rather than the expected 5 seconds.  Presumably that just reflects
overload on the buildfarm member rather than anything really exciting,
but it explains the failure nicely: by the time we got to postgres.c's
ProcessInterrupts(), both the lock and statement timeouts had expired,
and the code there preferentially reports "lock timeout" in that case.

What I propose we do about this is make ProcessInterrupts() compare
the scheduled timeout times if both timeouts have triggered, and report
whichever event was intended to happen first.  We should still apply
the "lock error first" tiebreaking rule if there's an exact tie, but
that's likely to be rare.

The attached patch does that, and also reduces the longer of the two
timeouts in these test cases to 5010 ms.  That should give us good
coverage in the buildfarm of both the case where both timeouts have
fired before reaching ProcessInterrupts(), and the case where they
have not.  We should get bulletproof reproducibility of which timeout
is reported, as long as the system clock doesn't go backwards during
the test.

Unfortunately this doesn't do anything to let us reduce the base 5000-ms
timeout :-(.  That's driven by the need to be sure that isolationtester
has detected that the processes are waiting, which is independent of this
problem.

I'd like to put this into all branches back to 9.3, since we've seen
this type of failure in all of them.  Any objections?

regards, tom lane

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 68811f1..b185c1b 100644
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
*** ProcessInterrupts(void)
*** 2909,2914 
--- 2909,2917 
  
  	if (QueryCancelPending)
  	{
+ 		bool		lock_timeout_occurred;
+ 		bool		stmt_timeout_occurred;
+ 
  		/*
  		 * Don't allow query cancel interrupts while reading input from the
  		 * client, because we might lose sync in the FE/BE protocol.  (Die
*** ProcessInterrupts(void)
*** 2929,2945 
  
  		/*
  		 * If LOCK_TIMEOUT and STATEMENT_TIMEOUT indicators are both set, we
! 		 * prefer to report the former; but be sure to clear both.
  		 */
! 		if (get_timeout_indicator(LOCK_TIMEOUT, true))
  		{
- 			(void) get_timeout_indicator(STATEMENT_TIMEOUT, true);
  			LockErrorCleanup();
  			ereport(ERROR,
  	(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
  	 errmsg("canceling statement due to lock timeout")));
  		}
! 		if (get_timeout_indicator(STATEMENT_TIMEOUT, true))
  		{
  			LockErrorCleanup();