Re: [HACKERS] 9.3: more problems with Could not open file pg_multixact/members/xxxx

2014-08-19 Thread Jeff Janes
On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera alvhe...@2ndquadrant.com
  wrote:

 Jeff Janes wrote:

  This problem was initially fairly easy to reproduce, but since I
  started adding instrumentation specifically to catch it, it has become
  devilishly hard to reproduce.
 
  I think my next step will be to also log each of the values which goes
  into the complex if (...) expression that decides on the deletion.

 Could you please to reproduce it after updating to latest?  I pushed
 fixes that should close these issues.  Maybe you want to remove the
 instrumentation you added, to make failures more likely.


 There are still some problems in 9.4, but I haven't been able to diagnose
 them and wanted to do more research on it.  The announcement of upcoming
 back-branches for 9.3 spurred me to try it there, and I have problems with
 9.3 (12c5bbdcbaa292b2a4b09d298786) as well.  The move of truncation to the
 checkpoint seems to have made the problem easier to reproduce.  On an 8
 core machine, this test fell over after about 20 minutes, which is much
 faster than it usually reproduces.

 This the error I get:

 2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR:  could not access status of
 transaction 85837221
 2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL:  Could not open file
 pg_multixact/members/14031: No such file or directory.
 2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT:  SQL statement SELECT 1
 FROM ONLY public.foo_parent x WHERE id OPERATOR(pg_catalog.=) $1 FOR
 KEY SHARE OF x

 The testing harness is attached as 3 patches that must be made to the test
 server, and 2 scripts. The script do.sh sets up the database (using fixed
 paths, so be careful) and then invokes count.pl in a loop to do the
 actual work.


Sorry, after a long time when I couldn't do much testing on this, I've now
been able to get back to it.

It looks like what is happening is that  checkPoint.nextMultiOffset wraps
around from 2^32 to 0, even if 0 is still being used.  At that point it
starts deleting member files that are still needed.

Is there some interlock which is supposed to prevent from
 checkPoint.nextMultiOffset rom lapping iself?  I haven't been able to find
it.  It seems like the interlock applies only to MultiXid, not the Offset.

Thanks,

Jeff


Re: [HACKERS] 9.3: more problems with Could not open file pg_multixact/members/xxxx

2014-08-19 Thread Andres Freund
On August 19, 2014 10:24:20 PM CEST, Jeff Janes jeff.ja...@gmail.com wrote:
On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes jeff.ja...@gmail.com
wrote:

 On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera
alvhe...@2ndquadrant.com
  wrote:

 Jeff Janes wrote:

  This problem was initially fairly easy to reproduce, but since I
  started adding instrumentation specifically to catch it, it has
become
  devilishly hard to reproduce.
 
  I think my next step will be to also log each of the values which
goes
  into the complex if (...) expression that decides on the deletion.

 Could you please to reproduce it after updating to latest?  I pushed
 fixes that should close these issues.  Maybe you want to remove the
 instrumentation you added, to make failures more likely.


 There are still some problems in 9.4, but I haven't been able to
diagnose
 them and wanted to do more research on it.  The announcement of
upcoming
 back-branches for 9.3 spurred me to try it there, and I have problems
with
 9.3 (12c5bbdcbaa292b2a4b09d298786) as well.  The move of truncation
to the
 checkpoint seems to have made the problem easier to reproduce.  On an
8
 core machine, this test fell over after about 20 minutes, which is
much
 faster than it usually reproduces.

 This the error I get:

 2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR:  could not access
status of
 transaction 85837221
 2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL:  Could not open file
 pg_multixact/members/14031: No such file or directory.
 2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT:  SQL statement
SELECT 1
 FROM ONLY public.foo_parent x WHERE id OPERATOR(pg_catalog.=)
$1 FOR
 KEY SHARE OF x

 The testing harness is attached as 3 patches that must be made to the
test
 server, and 2 scripts. The script do.sh sets up the database (using
fixed
 paths, so be careful) and then invokes count.pl in a loop to do the
 actual work.


Sorry, after a long time when I couldn't do much testing on this, I've
now
been able to get back to it.

It looks like what is happening is that  checkPoint.nextMultiOffset
wraps
around from 2^32 to 0, even if 0 is still being used.  At that point it
starts deleting member files that are still needed.

Is there some interlock which is supposed to prevent from
checkPoint.nextMultiOffset rom lapping iself?  I haven't been able to
find
it.  It seems like the interlock applies only to MultiXid, not the
Offset.

There is none (and there never has been one either). I've complained about it a 
couple of times but nobody, me included, had time and energy to fix that :(

Andres


--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers