Re: [HACKERS] SPGiST versus hot standby - question about conflict resolution rules

2012-03-13 Thread Simon Riggs
On Tue, Mar 13, 2012 at 2:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Info appreciated.

Email seen, will reply when I can later today.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SPGiST versus hot standby - question about conflict resolution rules

2012-03-12 Thread Tom Lane
There is one more (known) stop-ship problem in SPGiST, which I'd kind of
like to get out of the way now before I let my knowledge of that code
get swapped out again.  This is that SPGiST is unsafe for use by hot
standby slaves.

The problem comes from redirect tuples, which are short-lifespan
objects that replace a tuple that's been moved to another page.
A redirect tuple can be recycled as soon as no active indexscan could
be in flight from the parent index page to the moved tuple.  SPGiST
implements this by marking each redirect tuple with the XID of the
creating transaction, and assuming that the tuple can be recycled once
that XID is below the OldestXmin horizon (implying that all active
transactions started after it ended).  This is fine as far as
transactions on the master are concerned, but there is no guarantee that
the recycling WAL record couldn't be replayed on a hot standby slave
while there are still HS transactions that saw the old state of the
parent index tuple.

Now, btree has a very similar problem with deciding when it's safe to
recycle a deleted index page: it has to wait out transactions that could
be in flight to the page, and it does that by marking deleted pages with
XIDs.  I see that the problem has been patched for btree by emitting a
special WAL record just before a page is recycled.  However, I'm a bit
nervous about copying that solution, because the details are a bit
different.  In particular, I see that btree marks deleted pages with
ReadNewTransactionId() --- that is, the next-to-be-assigned XID ---
rather than the XID of the originating transaction, and then it
subtracts one from the XID before sending it to the WAL stream.
The comments about this are not clear enough for me, and so I'm
wondering whether it's okay to use the originating transaction XID
in a similar way, or if we need to modify SPGiST's rule for how to
mark redirection tuples.  I think that the use of ReadNewTransactionId
is because btree page deletion happens in VACUUM, which does not have
its own XID; this is unlike the situation for SPGiST where creation of
redirects is caused by index tuple insertion, so there is a surrounding
transaction with a real XID.  But it's not clear to me how
GetConflictingVirtualXIDs makes use of the limitXmin and whether a live
XID is okay to pass to it, or whether we actually need next XID - 1.

Info appreciated.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers