Re: [HACKERS] Gin page deletion bug

2013-11-08 Thread Heikki Linnakangas

On 07.11.2013 22:49, Heikki Linnakangas wrote:

Gin page deletion fails to take into account that there might be a
search in-flight to the page that is deleted. If the page is reused for
something else, the search can get very confused.

...

The regular b-tree code solves this by stamping deleted pages with the
current XID, and only allowing them to be reused once that XID becomes
old enough ( RecentGlobalXmin). Another approach might be to grab a
cleanup-strength lock on the left and parent pages when deleting a page,
and requiring search to keep the pin on the page its coming from, until
it has locked the next page.


I came up with the attached fix. In a nutshell, when walking along a 
right-link, the new page is locked before releasing the lock on the old 
one. Also, never delete the leftmost branch of a posting tree. I believe 
these changes are sufficient to fix the problem, because of the way the 
posting tree is searched:


All searches on a posting tree are full searches, scanning the whole 
tree from left to right. Insertions do seek to the middle of the tree, 
but they are locked out during vacuum by holding a cleanup lock on the 
posting tree root (even without this patch). Thanks to this property, a 
search cannot step on a page that's being deleted by following a 
downlink, if we refrain from deleting the leftmost page on a level, as 
searches only descend down the leftmost branch.


It would be nice to improve that in master - holding an exclusive lock 
on the root page is pretty horrible - but this is a nice back-patchable 
patch. I'm not worried about the loss of concurrency because we now have 
to hold the lock on previous page when stepping to next page. In 
searches, it's only a share-lock, and in insertions, it's very rare that 
you have to step right.


The patch also adds some sanity checks to stepping right: the next page 
should be of the same type as the current page, e.g stepping right 
should not go from leaf to non-leaf or vice versa.


- Heikki
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 2a6be4b..a738d80 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -112,10 +112,8 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
 /* rightmost page */
 break;
 
+			stack-buffer = ginStepRight(stack-buffer, btree-index, access);
 			stack-blkno = rightlink;
-			LockBuffer(stack-buffer, GIN_UNLOCK);
-			stack-buffer = ReleaseAndReadBuffer(stack-buffer, btree-index, stack-blkno);
-			LockBuffer(stack-buffer, access);
 			page = BufferGetPage(stack-buffer);
 		}
 
@@ -148,6 +146,41 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
 	}
 }
 
+/*
+ * Step right from current page.
+ *
+ * The next page is locked first, before releasing the current page. This is
+ * crucial to protect from concurrent page deletion (see comment in
+ * ginDeletePage).
+ */
+Buffer
+ginStepRight(Buffer buffer, Relation index, int lockmode)
+{
+	Buffer		nextbuffer;
+	Page		page = BufferGetPage(buffer);
+	bool		isLeaf = GinPageIsLeaf(page);
+	bool		isData = GinPageIsData(page);
+	BlockNumber	blkno = GinPageGetOpaque(page)-rightlink;
+
+	nextbuffer = ReadBuffer(index, blkno);
+	LockBuffer(nextbuffer, lockmode);
+	UnlockReleaseBuffer(buffer);
+
+	/* Sanity check that the page we stepped to is of similar kind. */
+	page = BufferGetPage(nextbuffer);
+	if (isLeaf != GinPageIsLeaf(page) || isData != GinPageIsData(page))
+		elog(ERROR, right sibling of GIN page is of different type);
+
+	/*
+	 * Given the proper lock sequence above, we should never land on a
+	 * deleted page.
+	 */
+	if  (GinPageIsDeleted(page))
+		elog(ERROR, right sibling of GIN page was deleted);
+
+	return nextbuffer;
+}
+
 void
 freeGinBtreeStack(GinBtreeStack *stack)
 {
@@ -235,12 +268,12 @@ ginFindParents(GinBtree btree, GinBtreeStack *stack,
 		while ((offset = btree-findChildPtr(btree, page, stack-blkno, InvalidOffsetNumber)) == InvalidOffsetNumber)
 		{
 			blkno = GinPageGetOpaque(page)-rightlink;
-			LockBuffer(buffer, GIN_UNLOCK);
-			ReleaseBuffer(buffer);
 			if (blkno == InvalidBlockNumber)
+			{
+UnlockReleaseBuffer(buffer);
 break;
-			buffer = ReadBuffer(btree-index, blkno);
-			LockBuffer(buffer, GIN_EXCLUSIVE);
+			}
+			buffer = ginStepRight(buffer, btree-index, GIN_EXCLUSIVE);
 			page = BufferGetPage(buffer);
 		}
 
@@ -439,23 +472,21 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 		{
 			BlockNumber rightlink = GinPageGetOpaque(page)-rightlink;
 
-			LockBuffer(parent-buffer, GIN_UNLOCK);
-
 			if (rightlink == InvalidBlockNumber)
 			{
 /*
  * rightmost page, but we don't find parent, we should use
  * plain search...
  */
+LockBuffer(parent-buffer, GIN_UNLOCK);
 ginFindParents(btree, stack, rootBlkno);
 parent = stack-parent;
 Assert(parent != NULL);
 break;
 			}
 
+			parent-buffer = ginStepRight(parent-buffer, btree-index, GIN_EXCLUSIVE);
 			

Re: [HACKERS] Gin page deletion bug

2013-11-08 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 I came up with the attached fix. In a nutshell, when walking along a 
 right-link, the new page is locked before releasing the lock on the old 
 one. Also, never delete the leftmost branch of a posting tree. I believe 
 these changes are sufficient to fix the problem, because of the way the 
 posting tree is searched:

This seems reasonable, but I wonder whether the concurrency discussion
shouldn't be in gist/README, rather than buried in a comment inside
ginDeletePage (not exactly the first place one would think to look IMO).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gin page deletion bug

2013-11-08 Thread Heikki Linnakangas

On 08.11.2013 19:14, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

I came up with the attached fix. In a nutshell, when walking along a
right-link, the new page is locked before releasing the lock on the old
one. Also, never delete the leftmost branch of a posting tree. I believe
these changes are sufficient to fix the problem, because of the way the
posting tree is searched:


This seems reasonable, but I wonder whether the concurrency discussion
shouldn't be in gist/README, rather than buried in a comment inside
ginDeletePage (not exactly the first place one would think to look IMO).


Yeah, README is probably better. There's zero explanation of concurrency 
and locking issues there at the moment, so I added a new section there 
explaining why the page deletion is (now) safe. There's a lot more to 
say about locking in GIN, but I'll leave that for another day.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Gin page deletion bug

2013-11-07 Thread Heikki Linnakangas
Gin page deletion fails to take into account that there might be a 
search in-flight to the page that is deleted. If the page is reused for 
something else, the search can get very confused.


That's pretty difficult to reproduce in a real system, as the window 
between releasing a lock on page and following its right-link is very 
tight, but by setting a breakpoint with a debugger it's easy. Here's how 
I reproduced it:


---
1. Put a breakpoint or sleep in entryGetNextItem() function, where it 
has released lock on one page and is about to read the next one. I used 
this patch:


--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -574,6 +574,9 @@ entryGetNextItem(GinState *ginstate, GinScanEntry entry)
return;
}

+   elog(NOTICE, about to move right to page %u, blkno);
+   sleep(5);
+
entry-buffer = ReleaseAndReadBuffer(entry-buffer,
   
  ginstate-index,

 blkno);

2. Initialize a page with a gin index in suitable state:

create extension btree_gin;
create table foo (i int4);
create index i_gin_foo on foo using gin (i) with (fastupdate = off);

insert into foo select 1 from generate_series(1, 5000);
insert into foo select 2 from generate_series(1, 5000);
set enable_bitmapscan=off; set enable_seqscan=on;
delete from foo where i = 1;

3. Start a query, it will sleep between every page:

set enable_bitmapscan=on; set enable_seqscan=off;
select * from foo where i = 1;
postgres=# select * from foo where i = 1;
NOTICE:  about to move right to page 3
NOTICE:  about to move right to page 5
...

4. In another session, delete and reuse the pages:

vacuum foo;
insert into foo select 2 from generate_series(1, 1) g

5. Let the query run to completion. It will return a lot of tuples with 
i=2, which should not have matched:


...
NOTICE:  about to move right to page 24
NOTICE:  about to move right to page 25
 i
---
 2
 2
 2
...

---

The regular b-tree code solves this by stamping deleted pages with the 
current XID, and only allowing them to be reused once that XID becomes 
old enough ( RecentGlobalXmin). Another approach might be to grab a 
cleanup-strength lock on the left and parent pages when deleting a page, 
and requiring search to keep the pin on the page its coming from, until 
it has locked the next page.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers