Re: [HACKERS] B-tree descend for insertion locking

2014-09-03 Thread Peter Geoghegan
On Tue, Mar 18, 2014 at 4:12 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 When inserting into a B-tree index, all the pages are read-locked when
 descending the tree. When we reach the leaf page, the read-lock is exchanged
 for a write-lock.

 There's nothing wrong with that, but why don't we just directly grab a
 write-lock on the leaf page?


Whatever happened to this idea?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-tree descend for insertion locking

2014-03-19 Thread Simon Riggs
On 18 March 2014 11:12, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 When inserting into a B-tree index, all the pages are read-locked when
 descending the tree. When we reach the leaf page, the read-lock is exchanged
 for a write-lock.

 There's nothing wrong with that, but why don't we just directly grab a
 write-lock on the leaf page? When descending, we know the level we're on,
 and what level the child page is. The only downside I can see is that we
 would unnecessarily hold a write-lock when a read-lock would suffice, if the
 page was just split and we have to move right. But that seems like a really
 bad bet - hitting the page when it was just split is highly unlikely.

Sounds good.

Grabbing write lock directly will reduce contention on the buffer, not
just reduce the code path.

If we have a considerable number of duplicates we would normally step
thru until we found a place to insert. Presumably that will happen
with write lock enabled, rather than read lock. Would this slow down
the insertion of highly duplicate keys under concurrent load? i.e. is
this a benefit for nearly-unique but not for other cases?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] B-tree descend for insertion locking

2014-03-18 Thread Heikki Linnakangas
When inserting into a B-tree index, all the pages are read-locked when 
descending the tree. When we reach the leaf page, the read-lock is 
exchanged for a write-lock.


There's nothing wrong with that, but why don't we just directly grab a 
write-lock on the leaf page? When descending, we know the level we're 
on, and what level the child page is. The only downside I can see is 
that we would unnecessarily hold a write-lock when a read-lock would 
suffice, if the page was just split and we have to move right. But that 
seems like a really bad bet - hitting the page when it was just split is 
highly unlikely.


Grabbing the write-lock directly would obviously save one buffer 
unlock+lock sequence, for what it's worth, but I think it would also 
slightly simplify the code. Am I missing something?


See attached patch. The new contract of _bt_getroot is a bit weird: it 
locks the returned page in the requested lock-mode, shared or exclusive, 
if the root page was also the leaf page. Otherwise it's locked in shared 
mode regardless off the requested lock mode. But OTOH, the new contract 
for _bt_search() is more clear now: it actually locks the returned page 
in the requested mode, where it used to only use the access parameter to 
decide whether to create a root page if the index was empty.


- Heikki
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5f7953f..3cb6404 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -113,24 +113,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	itup_scankey = _bt_mkscankey(rel, itup);
 
 top:
-	/* find the first page containing this key */
+	/* find the first page containing this key, and write-lock it */
 	stack = _bt_search(rel, natts, itup_scankey, false, buf, BT_WRITE);
 
 	offset = InvalidOffsetNumber;
 
-	/* trade in our read lock for a write lock */
-	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	LockBuffer(buf, BT_WRITE);
-
-	/*
-	 * If the page was split between the time that we surrendered our read
-	 * lock and acquired our write lock, then this page may no longer be the
-	 * right place for the key we want to insert.  In this case, we need to
-	 * move right in the tree.	See Lehman and Yao for an excruciatingly
-	 * precise description.
-	 */
-	buf = _bt_moveright(rel, buf, natts, itup_scankey, false, BT_WRITE);
-
 	/*
 	 * If we're not allowing duplicates, make sure the key isn't already in
 	 * the index.
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 14e422a..d6a4933 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -78,12 +78,13 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
  *		standard class of race conditions exists here; I think I covered
  *		them all in the Hopi Indian rain dance of lock requests below.
  *
- *		The access type parameter (BT_READ or BT_WRITE) controls whether
- *		a new root page will be created or not.  If access = BT_READ,
- *		and no root page exists, we just return InvalidBuffer.	For
- *		BT_WRITE, we try to create the root page if it doesn't exist.
- *		NOTE that the returned root page will have only a read lock set
- *		on it even if access = BT_WRITE!
+ *		The access type parameter (BT_READ or BT_WRITE) indicates whether
+ *		we're fetching the root for a search or an insertion.  If access =
+ *		BT_READ, and no root page exists, we just return InvalidBuffer.
+ *		For BT_WRITE, we try to create the root page if it doesn't exist.
+ *		If the root page is also a leaf-page, it is locked in the mode
+ *		specified.  NOTE that otherwise the returned root page will have
+ *		only a read lock on it,	even if access = BT_WRITE!
  *
  *		The returned page is not necessarily the true root --- it could be
  *		a fast root (a page that is alone in its level due to deletions).
@@ -127,7 +128,8 @@ _bt_getroot(Relation rel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad-btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		rootbuf = _bt_getbuf(rel, rootblkno,
+			 (rootlevel == 0) ? access : BT_READ);
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
 
@@ -253,14 +255,6 @@ _bt_getroot(Relation rel, int access)
 
 		END_CRIT_SECTION();
 
-		/*
-		 * swap root write lock for read lock.	There is no danger of anyone
-		 * else accessing the new root page while it's unlocked, since no one
-		 * else knows where it is yet.
-		 */
-		LockBuffer(rootbuf, BUFFER_LOCK_UNLOCK);
-		LockBuffer(rootbuf, BT_READ);
-
 		/* okay, metadata is correct, release lock on it */
 		_bt_relbuf(rel, metabuf);
 	}
@@ -285,7 +279,8 @@ _bt_getroot(Relation rel, int access)
 
 		for (;;)
 		{
-			rootbuf = _bt_relandgetbuf(rel, rootbuf, rootblkno, BT_READ);
+			rootbuf = _bt_relandgetbuf(rel, rootbuf, rootblkno,
+	   (rootlevel == 0) ? access : BT_READ);
 			rootpage = 

Re: [HACKERS] B-tree descend for insertion locking

2014-03-18 Thread Peter Geoghegan
On Tue, Mar 18, 2014 at 4:12 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 See attached patch. The new contract of _bt_getroot is a bit weird: it locks
 the returned page in the requested lock-mode, shared or exclusive, if the
 root page was also the leaf page. Otherwise it's locked in shared mode
 regardless off the requested lock mode. But OTOH, the new contract for
 _bt_search() is more clear now: it actually locks the returned page in the
 requested mode, where it used to only use the access parameter to decide
 whether to create a root page if the index was empty.

I had considered this myself, and am under the impression that the
only reason things work this way is because it makes the interface of
_bt_getroot() clearer. I think what you propose is logical, and you
should pursue it, but fwiw I'm not convinced that you've really
simplified _bt_search(). However, you have kind of made _bt_search()
into something that succinctly represents what is useful about Lehman
and Yao as compared to earlier concurrent locking techniques, and
that's a good thing. I would probably have remarked on that in the
comments if I were you. I certainly would not have left out some
reference to LY. You've removed an existing one where the lock
escalation occurs, emphasizing that it's safe, and I'd like to see you
at least compensate for that.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-tree descend for insertion locking

2014-03-18 Thread Amit Kapila
On Tue, Mar 18, 2014 at 4:42 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 When inserting into a B-tree index, all the pages are read-locked when
 descending the tree. When we reach the leaf page, the read-lock is exchanged
 for a write-lock.

 There's nothing wrong with that, but why don't we just directly grab a
 write-lock on the leaf page? When descending, we know the level we're on,
 and what level the child page is. The only downside I can see is that we
 would unnecessarily hold a write-lock when a read-lock would suffice, if the
 page was just split and we have to move right. But that seems like a really
 bad bet - hitting the page when it was just split is highly unlikely.

Another case could be when the page is half dead or deleted, but again
chances of same are relatively less.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers