Re: [HACKERS] What in the world is happening with castoroides and protosciurus?

2014-09-01 Thread Dave Page
On Sat, Aug 30, 2014 at 11:32 PM, Noah Misch n...@leadboat.com wrote:
 On Tue, Aug 26, 2014 at 10:17:05AM +0100, Dave Page wrote:
 On Tue, Aug 26, 2014 at 1:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  For the last month or so, these two buildfarm animals (which I believe are
  the same physical machine) have been erratically failing with errors that
  reflect low-order differences in floating-point calculations.
 
  A recent example is at
 
  http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurusdt=2014-08-25%2010%3A39%3A52
 
  where the only regression diff is
 
  *** 
  /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/expected/hash_index.out
 Mon Aug 25 11:41:00 2014
  --- 
  /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/results/hash_index.out
  Mon Aug 25 11:57:26 2014
  ***
  *** 171,179 
SELECT h.seqno AS i8096, h.random AS f1234_1234
   FROM hash_f8_heap h
   WHERE h.random = '-1234.1234'::float8;
  !  i8096 | f1234_1234
  ! ---+
  !   8906 | -1234.1234
(1 row)
 
UPDATE hash_f8_heap
  --- 171,179 
SELECT h.seqno AS i8096, h.random AS f1234_1234
   FROM hash_f8_heap h
   WHERE h.random = '-1234.1234'::float8;
  !  i8096 |f1234_1234
  ! ---+---
  !   8906 | -1234.12356777216
(1 row)
 
UPDATE hash_f8_heap
 
  ... a result that certainly makes no sense.  The results are not
  repeatable, failing in equally odd ways in different tests on different
  runs.  This is happening in all the back branches too, not just HEAD.

 I have
 no idea what is causing the current issue - the machine is stable
 software-wise, and only has private builds of dependency libraries
 update periodically (which are not used for the buildfarm). If I had
 to hazard a guess, I'd suggest this is an early symptom of an old
 machine which is starting to give up.

 Agreed.  Rerunning each animal against older commits would test that theory.
 Say, run against the last 6 months of REL9_0_STABLE commits.  If those runs
 show today's failure frequencies instead of historic failure frequencies, it's
 not a PostgreSQL regression.  Not that I see a commit back-patched near the
 time of the failure uptick (2014-08-06) that looks remotely likely to have
 introduced such a regression.

 It would be sad to lose our only buildfarm coverage of plain Solaris and of
 the Sun Studio compiler, but having buildfarm members this unstable is a pain.
 Perhaps have those animals retry the unreliable steps up to, say, 7 times?

That would require changes to the buildfarm client. I'll see if I can
find some alternate resources we can use.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What in the world is happening with castoroides and protosciurus?

2014-08-30 Thread Noah Misch
On Tue, Aug 26, 2014 at 10:17:05AM +0100, Dave Page wrote:
 On Tue, Aug 26, 2014 at 1:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  For the last month or so, these two buildfarm animals (which I believe are
  the same physical machine) have been erratically failing with errors that
  reflect low-order differences in floating-point calculations.
 
  A recent example is at
 
  http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurusdt=2014-08-25%2010%3A39%3A52
 
  where the only regression diff is
 
  *** 
  /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/expected/hash_index.out
 Mon Aug 25 11:41:00 2014
  --- 
  /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/results/hash_index.out
  Mon Aug 25 11:57:26 2014
  ***
  *** 171,179 
SELECT h.seqno AS i8096, h.random AS f1234_1234
   FROM hash_f8_heap h
   WHERE h.random = '-1234.1234'::float8;
  !  i8096 | f1234_1234
  ! ---+
  !   8906 | -1234.1234
(1 row)
 
UPDATE hash_f8_heap
  --- 171,179 
SELECT h.seqno AS i8096, h.random AS f1234_1234
   FROM hash_f8_heap h
   WHERE h.random = '-1234.1234'::float8;
  !  i8096 |f1234_1234
  ! ---+---
  !   8906 | -1234.12356777216
(1 row)
 
UPDATE hash_f8_heap
 
  ... a result that certainly makes no sense.  The results are not
  repeatable, failing in equally odd ways in different tests on different
  runs.  This is happening in all the back branches too, not just HEAD.

 I have
 no idea what is causing the current issue - the machine is stable
 software-wise, and only has private builds of dependency libraries
 update periodically (which are not used for the buildfarm). If I had
 to hazard a guess, I'd suggest this is an early symptom of an old
 machine which is starting to give up.

Agreed.  Rerunning each animal against older commits would test that theory.
Say, run against the last 6 months of REL9_0_STABLE commits.  If those runs
show today's failure frequencies instead of historic failure frequencies, it's
not a PostgreSQL regression.  Not that I see a commit back-patched near the
time of the failure uptick (2014-08-06) that looks remotely likely to have
introduced such a regression.

It would be sad to lose our only buildfarm coverage of plain Solaris and of
the Sun Studio compiler, but having buildfarm members this unstable is a pain.
Perhaps have those animals retry the unreliable steps up to, say, 7 times?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What in the world is happening with castoroides and protosciurus?

2014-08-26 Thread Dave Page
On Tue, Aug 26, 2014 at 1:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 For the last month or so, these two buildfarm animals (which I believe are
 the same physical machine) have been erratically failing with errors that
 reflect low-order differences in floating-point calculations.

 A recent example is at

 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurusdt=2014-08-25%2010%3A39%3A52

 where the only regression diff is

 *** 
 /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/expected/hash_index.out
Mon Aug 25 11:41:00 2014
 --- 
 /export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/results/hash_index.out
 Mon Aug 25 11:57:26 2014
 ***
 *** 171,179 
   SELECT h.seqno AS i8096, h.random AS f1234_1234
  FROM hash_f8_heap h
  WHERE h.random = '-1234.1234'::float8;
 !  i8096 | f1234_1234
 ! ---+
 !   8906 | -1234.1234
   (1 row)

   UPDATE hash_f8_heap
 --- 171,179 
   SELECT h.seqno AS i8096, h.random AS f1234_1234
  FROM hash_f8_heap h
  WHERE h.random = '-1234.1234'::float8;
 !  i8096 |f1234_1234
 ! ---+---
 !   8906 | -1234.12356777216
   (1 row)

   UPDATE hash_f8_heap

 ... a result that certainly makes no sense.  The results are not
 repeatable, failing in equally odd ways in different tests on different
 runs.  This is happening in all the back branches too, not just HEAD.

 Has there been a system software update on this machine a month or so ago?
 If not, it's hard to think anything except that the floating point
 hardware on this box has developed problems.

There hasn't been a software update, but something happened about two
months ago, and we couldn't get to the bottom of exactly what it was -
essentially, castoroides started failing with C compiler cannot
create executables. It appeared that the compiler was missing from
the path, however the config hadn't changed. Our working theory is
that there was previously a symlink to the compiler in one of the
directories in the path, that somehow got removed. The issue was fixed
by adding the actual compiler location to the path.

However, that would have only affected castoroides, and not
protosciurus which runs under a different environment config. I have
no idea what is causing the current issue - the machine is stable
software-wise, and only has private builds of dependency libraries
update periodically (which are not used for the buildfarm). If I had
to hazard a guess, I'd suggest this is an early symptom of an old
machine which is starting to give up.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] What in the world is happening with castoroides and protosciurus?

2014-08-25 Thread Tom Lane
For the last month or so, these two buildfarm animals (which I believe are
the same physical machine) have been erratically failing with errors that
reflect low-order differences in floating-point calculations.

A recent example is at

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=protosciurusdt=2014-08-25%2010%3A39%3A52

where the only regression diff is

*** 
/export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/expected/hash_index.out
   Mon Aug 25 11:41:00 2014
--- 
/export/home/dpage/pgbuildfarm/protosciurus/HEAD/pgsql.22860/src/test/regress/results/hash_index.out
Mon Aug 25 11:57:26 2014
***
*** 171,179 
  SELECT h.seqno AS i8096, h.random AS f1234_1234
 FROM hash_f8_heap h
 WHERE h.random = '-1234.1234'::float8;
!  i8096 | f1234_1234 
! ---+
!   8906 | -1234.1234
  (1 row)
  
  UPDATE hash_f8_heap
--- 171,179 
  SELECT h.seqno AS i8096, h.random AS f1234_1234
 FROM hash_f8_heap h
 WHERE h.random = '-1234.1234'::float8;
!  i8096 |f1234_1234 
! ---+---
!   8906 | -1234.12356777216
  (1 row)
  
  UPDATE hash_f8_heap

... a result that certainly makes no sense.  The results are not
repeatable, failing in equally odd ways in different tests on different
runs.  This is happening in all the back branches too, not just HEAD.

Has there been a system software update on this machine a month or so ago?
If not, it's hard to think anything except that the floating point
hardware on this box has developed problems.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers