Re: [HACKERS] VACUUM FULL versus relcache init files

2011-08-16 Thread Robert Haas
On Mon, Aug 15, 2011 at 9:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 This might be the last bug from my concurrent-vacuum-full testing --- at
 least, I have no remaining unexplained events from about two full days
 of running the tests.  The ones that are left involve backends randomly
 failing like this:

 psql: FATAL:  could not read block 0 in file base/130532080/130545668: read 
 only 0 of 8192 bytes

 usually during startup, though I have one example of a backend being
 repeatedly unable to access pg_proc due to similar errors.

 I believe what this traces to is stale relfilenode information taken
 from relcache init files, which contain precomputed relcache data
 intended to speed up backend startup.  There is a curious little dance
 between write_relcache_init_file and RelationCacheInitFileInvalidate
 that is intended to ensure that when a process creates a new relcache
 init file that is already stale (because someone else invalidated the
 information concurrently), the bad file will get unlinked and not used.
 I think I invented that logic, so it's my fault that it doesn't work.

 It works fine as long as you consider only the two processes directly
 involved; but a third process can get fooled into using stale data.
 The scenario requires two successive invalidations, as for example from
 vacuum full on two system catalogs in a row, plus a stream of incoming
 new backends.  It goes like this:

 1. Process A vacuums a system catalog, unlinks init file, sends sinval
 messages, does second unlink (which does nothing).

 2. Process B starts, observes lack of init file, begins to construct
 a new one.  It gets to the end of AcceptInvalidationMessages in
 write_relcache_init_file.  Since it started after A sent the first
 sinvals, it sees no incoming sinval messages and has no reason to think
 its new init file isn't good.

 3. Meanwhile, Process A vacuums another system catalog, unlinks init
 file (doing nothing), and finally sends its sinval messages just after
 B looked for them.  Now it will block trying to get RelCacheInitLock,
 which B holds.

 4. Process B renames its already-stale init file into place, then
 releases RelCacheInitLock.

 5. Process A gets the lock and removes the stale init file.

 Now process B is okay, because it will see A's second sinvals before it
 tries to make any use of the relcache data it has.  And the stale init
 file is definitely gone after step 5.

 However, between steps 4 and 5 there is a window for Process C to start,
 read the stale init file, and attempt to use it.  Since C started after
 A's second set of sinval messages, it doesn't see them and doesn't know
 it has stale data.

 As far as I can see at the moment, the only way to make this bulletproof
 is to turn both creation and deletion of the init file into atomic
 operations that include sinval messaging.  What I have in mind is

 Creator: must take RelCacheInitLock, check for incoming invals, rename
 the new file into place if none, release RelCacheInitLock.  (This is
 the same as what it does now.)

 Destroyer: must take RelCacheInitLock, unlink the init file, send its
 sinvals, release RelCacheInitLock.

 This guarantees that we serialize the sending of the sinval messages
 so that anyone who sees a bad init file in place *must* see the sinval
 messages afterwards, so long as they join the sinval messaging ring
 before looking for the init file (which they do).  I don't think it's
 any worse than the current scheme from a parallelism point of view: the
 destroyer is holding RelCacheInitLock a bit longer than before, but that
 should not be a performance critical situation.

 Anybody see a hole in that?

I don't.  Seems more robust than the old way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] VACUUM FULL versus relcache init files

2011-08-15 Thread Tom Lane
This might be the last bug from my concurrent-vacuum-full testing --- at
least, I have no remaining unexplained events from about two full days
of running the tests.  The ones that are left involve backends randomly
failing like this:

psql: FATAL:  could not read block 0 in file base/130532080/130545668: read 
only 0 of 8192 bytes

usually during startup, though I have one example of a backend being
repeatedly unable to access pg_proc due to similar errors.

I believe what this traces to is stale relfilenode information taken
from relcache init files, which contain precomputed relcache data
intended to speed up backend startup.  There is a curious little dance
between write_relcache_init_file and RelationCacheInitFileInvalidate
that is intended to ensure that when a process creates a new relcache
init file that is already stale (because someone else invalidated the
information concurrently), the bad file will get unlinked and not used.
I think I invented that logic, so it's my fault that it doesn't work.

It works fine as long as you consider only the two processes directly
involved; but a third process can get fooled into using stale data.
The scenario requires two successive invalidations, as for example from
vacuum full on two system catalogs in a row, plus a stream of incoming
new backends.  It goes like this:

1. Process A vacuums a system catalog, unlinks init file, sends sinval
messages, does second unlink (which does nothing).

2. Process B starts, observes lack of init file, begins to construct
a new one.  It gets to the end of AcceptInvalidationMessages in
write_relcache_init_file.  Since it started after A sent the first
sinvals, it sees no incoming sinval messages and has no reason to think
its new init file isn't good.

3. Meanwhile, Process A vacuums another system catalog, unlinks init
file (doing nothing), and finally sends its sinval messages just after
B looked for them.  Now it will block trying to get RelCacheInitLock,
which B holds.

4. Process B renames its already-stale init file into place, then
releases RelCacheInitLock.

5. Process A gets the lock and removes the stale init file.

Now process B is okay, because it will see A's second sinvals before it
tries to make any use of the relcache data it has.  And the stale init
file is definitely gone after step 5.

However, between steps 4 and 5 there is a window for Process C to start,
read the stale init file, and attempt to use it.  Since C started after
A's second set of sinval messages, it doesn't see them and doesn't know
it has stale data.

As far as I can see at the moment, the only way to make this bulletproof
is to turn both creation and deletion of the init file into atomic
operations that include sinval messaging.  What I have in mind is

Creator: must take RelCacheInitLock, check for incoming invals, rename
the new file into place if none, release RelCacheInitLock.  (This is
the same as what it does now.)

Destroyer: must take RelCacheInitLock, unlink the init file, send its
sinvals, release RelCacheInitLock.

This guarantees that we serialize the sending of the sinval messages
so that anyone who sees a bad init file in place *must* see the sinval
messages afterwards, so long as they join the sinval messaging ring
before looking for the init file (which they do).  I don't think it's
any worse than the current scheme from a parallelism point of view: the
destroyer is holding RelCacheInitLock a bit longer than before, but that
should not be a performance critical situation.

Anybody see a hole in that?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers