Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-11-04 Thread Bruce Momjian
Tom Lane wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
  Excerpts from Alvaro Herrera's message of vie oct 28 16:47:13 -0300 2011:
  BTW we had previous discussions about dropping pg_database's toast
  table.  Maybe this is a good time to do it, even if there's no risk of
  this bug (or the hypothetical circularity detoasting problem) showing up
  there.
 
 No objection from me.
 
  Oh, something unrelated I just remembered: we have
  pg_database.datlastsysoid which seems unused.  Perhaps we should remove
  that column for cleanliness.
 
 I have a vague recollection that some client-side code uses this to
 figure out what's the dividing line between user and system OIDs.
 pg_dump used to need to know that number, and it can still be useful
 with casts and other things that are hard to tell whether they're built-in.
 While in principle people could use FirstNormalObjectId instead, that
 could come back to bite us if we ever have to increase that constant
 in future releases.  I'm inclined to leave this one alone.

Maybe add a C comment?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Wait a minute: I'm confused.  What's at issue here, at least AIUI, is
 taking a TOAST pointer and fetching the corresponding value.  But that
 must involve reading from the TOAST table, and our usual paradigm for
 reading from system catalogs is (1) take AccessShareLock, (2) read the
 relevant rows, (3) release AccessShareLock.  If we're doing that here,
 then AcceptInvalidationMessages() is already getting called.  If we're
 not, this seems horribly unsafe; aside from the current bug, somebody
 could rewrite the table in the interim.

Yes, the TOAST fetch will call AcceptInvalidationMessages().  But
(1) we are starting from a TOAST pointer that is within a now-stale
syscache entry, and even if we get an sinval message telling us it's
stale when we open the toast table, we're still going to try to fetch
that toast OID.  Worse, (2) there is no guarantee that the inval message
has even been sent yet, because there is no lock keeping us from trying
to use the syscache entry before the inval has been sent.

After further reflection I believe that pg_statistic entries invalidated
by ANALYZE are not the only problem.  Consider a pg_proc row whose
prosrc field is toasted (not too unlikely), and suppose that somebody
executes CREATE OR REPLACE FUNCTION to update it to a different toasted
value.  Meanwhile, somebody else is about to use the function, and they
have a copy of its pg_proc row in syscache.  There is no lock that will
block that second backend from fetching the toast tuples before it's
seen the sinval message from the CREATE OR REPLACE FUNCTION.  So the
exact same type of race can occur, if the first backend gets delayed
between committing and broadcasting its sinval messages, and an
overeager vacuum comes along to clean up the dead toast tuples.

I am currently thinking that we will have to go with Heikki's solution
of detoasting any out-of-line fields before we put a tuple into
syscache.  That does fix the problem, so long as we hold a transaction
snapshot at the instant we fetch any catalog tuple that needs this
treatment, because at the time we fetch the tuple we make a SnapshotNow
test that shows the tuple is good (and its dependent toast tuples are
too).  Even if somebody has outdated the tuple and commits immediately
afterward, vacuum cannot remove the toast tuples before we fetch them,
because the outdater is beyond our advertised MyProc-xmin.  The risk
factor comes in when we hold a syscache entry across transactions; then
this guarantee is lost.  (In both the original example and the pg_proc
case above, the second backend is only at risk when it starts its
transaction after the first one's commit, and in between VACUUM was able
to compute an OldestXmin greater than the first one's XID.)

I guess an alternative approach would be to discard syscache entries
at transaction end if HeapTupleHasExternal is true, or equivalently
mark them as to which transaction read them and not use them in later
transactions.  But that seems more fragile, because it would have to
be tied into SnapshotResetXmin --- any time we discard our last snapshot
intra-transaction, toasted syscache entries would have to be invalidated
since our advertised xmin is no longer protecting them.

The main concern I had about detoast before caching is the risk of
circularity, ie, needing detoastable cache entries in order to figure
out how to detoast.  But I think it's probably okay.  The current list
of catalogs with toast tables is

 pg_attrdef
 pg_constraint
 pg_database
 pg_db_role_setting
 pg_description
 pg_proc
 pg_rewrite
 pg_seclabel
 pg_shdescription
 pg_statistic
 pg_trigger

Of these, only pg_proc is even conceivably consulted during a toast
table fetch, and we can be sure that functions needed for such a fetch
don't have toasted entries.  But we will have to be very wary of any
future proposal for adding a toast table to pg_class, pg_index, etc.

Detoast-before-caching also seems sufficiently safe and localized to
be a back-patchable fix, whereas I'd be very scared of making any
significant overhaul of the sinval mechanism in the back branches.

Barring objections, I'll see about making this happen.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-28 Thread Alvaro Herrera

Excerpts from Tom Lane's message of vie oct 28 15:37:43 -0300 2011:

 The main concern I had about detoast before caching is the risk of
 circularity, ie, needing detoastable cache entries in order to figure
 out how to detoast.  But I think it's probably okay.  The current list
 of catalogs with toast tables is
 
  pg_attrdef
  pg_constraint
  pg_database
  pg_db_role_setting
  pg_description
  pg_proc
  pg_rewrite
  pg_seclabel
  pg_shdescription
  pg_statistic
  pg_trigger
 
 Of these, only pg_proc is even conceivably consulted during a toast
 table fetch, and we can be sure that functions needed for such a fetch
 don't have toasted entries.  But we will have to be very wary of any
 future proposal for adding a toast table to pg_class, pg_index, etc.

BTW we had previous discussions about dropping pg_database's toast
table.  Maybe this is a good time to do it, even if there's no risk of
this bug (or the hypothetical circularity detoasting problem) showing up
there.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-28 Thread Alvaro Herrera

Excerpts from Alvaro Herrera's message of vie oct 28 16:47:13 -0300 2011:

 BTW we had previous discussions about dropping pg_database's toast
 table.  Maybe this is a good time to do it, even if there's no risk of
 this bug (or the hypothetical circularity detoasting problem) showing up
 there.

Oh, something unrelated I just remembered: we have
pg_database.datlastsysoid which seems unused.  Perhaps we should remove
that column for cleanliness.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-28 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 The risk factor comes in when we hold a syscache entry across
 transactions; then this guarantee is lost.  (In both the original
 example and the pg_proc case above, the second backend is only at
 risk when it starts its transaction after the first one's commit,
 and in between VACUUM was able to compute an OldestXmin greater
 than the first one's XID.)
 
If we made the commit sequence number more generally available,
incrementing it at the point of visibility change under cover of
ProcArrayLock, and including the then-current value in a Snapshot
object when built, would that help with this at all?  Since it's
something we might need to do to optimize SSI for 9.2, and the
issues sound vaguely echoic of the reasons this is maintained for
SSI, it seems worth asking the question, anyway.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-28 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 If we made the commit sequence number more generally available,
 incrementing it at the point of visibility change under cover of
 ProcArrayLock, and including the then-current value in a Snapshot
 object when built, would that help with this at all?

No, because we need a back-patchable fix.  Even going forward,
I don't find the idea of flushing syscache entries at transaction
end to be especially appealing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-28 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Alvaro Herrera's message of vie oct 28 16:47:13 -0300 2011:
 BTW we had previous discussions about dropping pg_database's toast
 table.  Maybe this is a good time to do it, even if there's no risk of
 this bug (or the hypothetical circularity detoasting problem) showing up
 there.

No objection from me.

 Oh, something unrelated I just remembered: we have
 pg_database.datlastsysoid which seems unused.  Perhaps we should remove
 that column for cleanliness.

I have a vague recollection that some client-side code uses this to
figure out what's the dividing line between user and system OIDs.
pg_dump used to need to know that number, and it can still be useful
with casts and other things that are hard to tell whether they're built-in.
While in principle people could use FirstNormalObjectId instead, that
could come back to bite us if we ever have to increase that constant
in future releases.  I'm inclined to leave this one alone.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-28 Thread Merlin Moncure
On Fri, Oct 28, 2011 at 3:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Alvaro Herrera's message of vie oct 28 16:47:13 -0300 2011:
 BTW we had previous discussions about dropping pg_database's toast
 table.  Maybe this is a good time to do it, even if there's no risk of
 this bug (or the hypothetical circularity detoasting problem) showing up
 there.

 No objection from me.

 Oh, something unrelated I just remembered: we have
 pg_database.datlastsysoid which seems unused.  Perhaps we should remove
 that column for cleanliness.

 I have a vague recollection that some client-side code uses this to
 figure out what's the dividing line between user and system OIDs.
 pg_dump used to need to know that number, and it can still be useful
 with casts and other things that are hard to tell whether they're built-in.
 While in principle people could use FirstNormalObjectId instead, that
 could come back to bite us if we ever have to increase that constant
 in future releases.  I'm inclined to leave this one alone.

A cursory search of the archives turned up that pgadmin is explicitly
querying it out (or at least did in 2009).  Also, there are some
fairly ancient references of discussion about using it via pg_dump to
deal with the 'not dumping user casts wrapping system objects' issue,
but that was deemed unreliable and is probably not relevant.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-26 Thread Simon Riggs
On Wed, Oct 26, 2011 at 4:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Even given your recent changes to reduce the overhead of checking for
 sinval messages, I'm not sure that it'd be practical to move the sinval
 message processing to just-before-we-look-up-a-cache-entry.  Right now,
 we do AcceptInvalidationMessages basically once per table per query
 (or maybe it's twice or so, but anyway a very small multiplier on that).
 If we try to do it every time through SearchSysCache, we are probably
 talking two to three orders of magnitude more checks, which ISTM is
 certain to push the sinval queue back up to the top of the heap for
 contention.

Initially I provided an implementation of eager locking, as Robert
suggests, but the above is a great argument against doing it that way.

Incidentally, I'd like to focus on the causal reason for these
problems. We have static type definitions, which allow us to
completely avoid dynamic type management at runtime. That is a
performance advantage for us in normal running, but it can also give
operational problems when we look to make changes.

 But in any case, this isn't the core of the problem.  The real point
 here is that we need a guarantee that a syscache entry we're going to
 use is/was valid as of some suitable time point later than the start of
 our current transaction.  (Once we have taken a snapshot, VACUUM will
 know that it can't remove any tuples that were deleted after the time of
 that snapshot; so even for SnapshotNow fetches, it's important to have
 an MVCC snapshot to protect toast-table dereferences.)  Perhaps rather
 than tying the problem into SearchSysCache, we should attach the
 overhead to GetTransactionSnapshot, which is called appealingly few
 times per query.  But right offhand it seems like that only protects us
 against the toast-tuple-deletion problem, not against the more general
 one of getting a stale view of the status of some relation.

Do we need to guarantee that? It seems strange to me to use the words
cache and strict in the same sentence.

I'm sure there are many points in the code where strictness is
required though also in many cases, it should be acceptable to say
that a cache miss is not a problem, so does not require strict
guarantees.

Do we need a redesign? Or do we need a way to use
relaxation/optimistic techniques.

I'm aware that it could be a huge undertaking to examine all call
points. If you have any ideas of where to investigate or parts of the
problem to assist with, I'll be happy to work on this. This seems
important to me.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-26 Thread Robert Haas
On Tue, Oct 25, 2011 at 11:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Even given your recent changes to reduce the overhead of checking for
 sinval messages, I'm not sure that it'd be practical to move the sinval
 message processing to just-before-we-look-up-a-cache-entry.

Wait a minute: I'm confused.  What's at issue here, at least AIUI, is
taking a TOAST pointer and fetching the corresponding value.  But that
must involve reading from the TOAST table, and our usual paradigm for
reading from system catalogs is (1) take AccessShareLock, (2) read the
relevant rows, (3) release AccessShareLock.  If we're doing that here,
then AcceptInvalidationMessages() is already getting called.  If we're
not, this seems horribly unsafe; aside from the current bug, somebody
could rewrite the table in the interim.

 Right now,
 we do AcceptInvalidationMessages basically once per table per query
 (or maybe it's twice or so, but anyway a very small multiplier on that).
 If we try to do it every time through SearchSysCache, we are probably
 talking two to three orders of magnitude more checks, which ISTM is
 certain to push the sinval queue back up to the top of the heap for
 contention.

I think we've pretty much got the *contention* licked, barring an
increase in the number of things that generate sinval messages, but
calling it too often will still be slow, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-25 Thread Tom Lane
I believe I have reproduced the behavior described by Andrew Hammond in
http://archives.postgresql.org/pgsql-general/2011-10/msg00928.php

This is using the regression database:

1. In session 1, do
set default_statistics_target TO 1;
analyze tenk1;
(We need the large stats target to ensure that tenk1's pg_statistic
entries require toasting.)

2. Attach to session 1 with a debugger and set a breakpoint at
CommitTransaction's call to CallXactCallbacks (or anyplace after
ProcArrayEndTransaction and before AtEOXact_Inval).

3. In session 2, do

select count(*) from tenk1 where fivethous  2500;

(This loads up session 2's syscaches with toasted pg_statistic entries.)

4. In session 1, again do

analyze tenk1;

and wait for it to stop at the breakpoint.

5. In session 3 (or you can use session 2 for this), do
 vacuum verbose pg_statistic;
You should see it removing toast entries that were generated in step 1
and obsoleted in step 4.

6. In session 2, again do

select count(*) from tenk1 where fivethous  2500;

and voila:

ERROR:  missing chunk number 0 for toast value 53668 in pg_toast_2619

What has happened here is that the second ANALYZE has marked itself
committed in pg_clog and no longer running in the ProcArray, so VACUUM
feels entitled to remove toast tuples that the ANALYZE deleted.  However,
the ANALYZE has not yet sent out the sinval messages that would inform
session 2 that its syscache entries are obsolete.  In Andrew's report,
presumably the machine was under enough load to slow down ANALYZE at
just this point, and there was a concurrent autovacuum that would have
done the rest of the deed.  The problem could only be seen for a short
interval, which squares with his report, and with a similar one from
Tim Uckun back in September.

Ordinarily, sending out sinval messages post-commit is okay because we
don't release locks until after that, and we suppose that our locks
prevent any other transactions from getting to the point of using
syscache entries that might have been invalidated by our transaction.
However, *we have carefully hacked on ANALYZE until it doesn't take any
locks that would block concurrent queries on the analyzed table.*  So
the normal protection against stale syscache entries simply doesn't
work for pg_statistic fetches.

I'm not sure about a good way to fix this.  When we last dealt with a
similar failure, Heikki suggested that we forcibly detoast all fields in
a tuple that we're putting into the syscaches:
http://archives.postgresql.org/pgsql-hackers/2011-08/msg00661.php
I don't much like that, though, as it seems expensive, and I'm worried
about possible circularity of needing to know about all toastable fields
while making a syscache entry, and anyway it's papering over a symptom
rather than solving the actual problem that we're relying on a stale
syscache entry.

We could fix it by not using a syscache anymore for pg_statistic
entries, but that's probably not acceptable from a performance
standpoint.

A clean fix would be to add locking that blocks would-be users of
pg_statistic entries when an ANALYZE is about to commit.  This isn't
much fun from a performance standpoint either, but at least it should be
relatively cheap most of the time.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-25 Thread Robert Haas
On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 What has happened here is that the second ANALYZE has marked itself
 committed in pg_clog and no longer running in the ProcArray, so VACUUM
 feels entitled to remove toast tuples that the ANALYZE deleted.  However,
 the ANALYZE has not yet sent out the sinval messages that would inform
 session 2 that its syscache entries are obsolete.  In Andrew's report,
 presumably the machine was under enough load to slow down ANALYZE at
 just this point, and there was a concurrent autovacuum that would have
 done the rest of the deed.  The problem could only be seen for a short
 interval, which squares with his report, and with a similar one from
 Tim Uckun back in September.

 Ordinarily, sending out sinval messages post-commit is okay because we
 don't release locks until after that, and we suppose that our locks
 prevent any other transactions from getting to the point of using
 syscache entries that might have been invalidated by our transaction.
 However, *we have carefully hacked on ANALYZE until it doesn't take any
 locks that would block concurrent queries on the analyzed table.*  So
 the normal protection against stale syscache entries simply doesn't
 work for pg_statistic fetches.

This is very similar to one of the issues that reared its ugly head in
regards to Simon's now-reverted patch to lower DDL locking strength.
You identified some other issues there as well, but *one* of the
issues was that, as in this case, the sinval mechanism fails to
provide the necessary synchronization guarantees unless the lock
required to reread the updated data conflicts with the lock required
to change the data.  In that case, the data meant the pg_class
entry or the pg_attribute entry whereas here it means the
pg_statistic entry, but I believe the principal is the same.  And
there as here, (1) there is a fundamental conflict between what the
sinval mechanism requires for correctness and what is actually
desirable in terms of lock levels from a user experience point of view
and (2) it is relatively easy to write code that looks superficially
safe but which actually contains subtle race conditions.  IIRC, you
never thought Simon's patch looked safe, but I'm guessing that this
pg_statistic bug has been around for a long time.

So I'm wondering if we ought to rethink our position that users of the
sinval machinery must provide their own external synchronization
through heavyweight locking, and instead build the synchronization
into the sinval mechanism itself.  One idea I had was to include the
XID of the transaction sending the sinval mechanism in every message,
and to force clients receiving a message to do XactLockTableWait() for
each such XID.  That would force the backend reloading its cache to
wait until the committing transaction reaches the lock-release phase.
If we sent out sinval messages just before removing ourselves from the
ProcArray, I think that would more-or-less fix this bug (although
maybe I'm missing some reason why it's not practical to send them that
early) except that I don't see any way to handle the sinval-reset
case, which seems to more or less kill this idea in its tracks.

But maybe there's some other mechanism whereby we could combine
sending the sinval messages slightly earlier (before
ProcArrayEndTransaction) with blocking anyone who processes those
messages until after the committing backend finishes
ProcArrayEndTransaction.  For example, you could add an additional
LWLock, which has to be held in exclusive mode by a committing
transaction that sends any sinval messages.  It must be acquired
before sending the sinval messages and can't be released until after
ProcArrayEndTransaction() is complete.  Anyone processing a sinval
message must acquire and release the lock in shared mode before
reloading their caches, so that we guarantee that at the time you
reread the catalogs, any transactions involved in sending those
messages are visible.

That's actually a bit coarse-grained; there's probably a better
mechanism, but I'm just throwing this out to see if the basic idea has
any legs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TOAST versus VACUUM, or missing chunk number 0 for toast value identified

2011-10-25 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Ordinarily, sending out sinval messages post-commit is okay because we
 don't release locks until after that, and we suppose that our locks
 prevent any other transactions from getting to the point of using
 syscache entries that might have been invalidated by our transaction.
 However, *we have carefully hacked on ANALYZE until it doesn't take any
 locks that would block concurrent queries on the analyzed table.*  So
 the normal protection against stale syscache entries simply doesn't
 work for pg_statistic fetches.

 This is very similar to one of the issues that reared its ugly head in
 regards to Simon's now-reverted patch to lower DDL locking strength.
 You identified some other issues there as well, but *one* of the
 issues was that, as in this case, the sinval mechanism fails to
 provide the necessary synchronization guarantees unless the lock
 required to reread the updated data conflicts with the lock required
 to change the data.

Right.  We may take as little as AccessShareLock on a relation before
examining its pg_statistic entries, and ANALYZE isn't taking anything
that would block that.

 So I'm wondering if we ought to rethink our position that users of the
 sinval machinery must provide their own external synchronization
 through heavyweight locking, and instead build the synchronization
 into the sinval mechanism itself.

Yeah, it's starting to feel like we need a basic redesign of sinval
... although I'd not care to back-patch that, so we also need to think
of a sane solution for the back branches.

 If we sent out sinval messages just before removing ourselves from the
 ProcArray, I think that would more-or-less fix this bug (although
 maybe I'm missing some reason why it's not practical to send them that
 early) except that I don't see any way to handle the sinval-reset
 case, which seems to more or less kill this idea in its tracks.

The other reason that doesn't work is there's a race condition: someone
might load their cache entry immediately after the sinval message went
past, but before the updating transaction commits.

 But maybe there's some other mechanism whereby we could combine
 sending the sinval messages slightly earlier (before
 ProcArrayEndTransaction) with blocking anyone who processes those
 messages until after the committing backend finishes
 ProcArrayEndTransaction.  For example, you could add an additional
 LWLock, which has to be held in exclusive mode by a committing
 transaction that sends any sinval messages.

Doesn't sound very scalable :-(.

Even given your recent changes to reduce the overhead of checking for
sinval messages, I'm not sure that it'd be practical to move the sinval
message processing to just-before-we-look-up-a-cache-entry.  Right now,
we do AcceptInvalidationMessages basically once per table per query
(or maybe it's twice or so, but anyway a very small multiplier on that).
If we try to do it every time through SearchSysCache, we are probably
talking two to three orders of magnitude more checks, which ISTM is
certain to push the sinval queue back up to the top of the heap for
contention.

But in any case, this isn't the core of the problem.  The real point
here is that we need a guarantee that a syscache entry we're going to
use is/was valid as of some suitable time point later than the start of
our current transaction.  (Once we have taken a snapshot, VACUUM will
know that it can't remove any tuples that were deleted after the time of
that snapshot; so even for SnapshotNow fetches, it's important to have
an MVCC snapshot to protect toast-table dereferences.)  Perhaps rather
than tying the problem into SearchSysCache, we should attach the
overhead to GetTransactionSnapshot, which is called appealingly few
times per query.  But right offhand it seems like that only protects us
against the toast-tuple-deletion problem, not against the more general
one of getting a stale view of the status of some relation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers