Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-11-24 Thread Heikki Linnakangas

On 09/27/2014 09:36 AM, Peter Geoghegan wrote:

On Fri, Sep 26, 2014 at 11:34 PM, Amit Kapila amit.kapil...@gmail.com wrote:

I have observed that this patch is in 'Needs Review' state for
next CF.  Do you expect any further review from myside?  I think
we can use text recommended by Heikki and after that if you
feel something more is still required, then you can update the same
and send an updated patch.  I believe it should be in 'Waiting On Author'
state, please do let me know if you feel otherwise.


I don't think so. Thanks for the review!


Ok, applied those extra paragraphs now, and marked as committed in the 
commitfest.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-11-24 Thread Peter Geoghegan
On Mon, Nov 24, 2014 at 3:51 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Ok, applied those extra paragraphs now, and marked as committed in the
 commitfest.

Thanks!

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-27 Thread Amit Kapila
On Fri, Sep 12, 2014 at 10:54 PM, Peter Geoghegan p...@heroku.com wrote:

 It isn't. It's a minor point, originally raised by Amit.

I have observed that this patch is in 'Needs Review' state for
next CF.  Do you expect any further review from myside?  I think
we can use text recommended by Heikki and after that if you
feel something more is still required, then you can update the same
and send an updated patch.  I believe it should be in 'Waiting On Author'
state, please do let me know if you feel otherwise.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-27 Thread Peter Geoghegan
On Fri, Sep 26, 2014 at 11:34 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 I have observed that this patch is in 'Needs Review' state for
 next CF.  Do you expect any further review from myside?  I think
 we can use text recommended by Heikki and after that if you
 feel something more is still required, then you can update the same
 and send an updated patch.  I believe it should be in 'Waiting On Author'
 state, please do let me know if you feel otherwise.

I don't think so. Thanks for the review!

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Heikki Linnakangas

On 09/11/2014 11:47 PM, Peter Geoghegan wrote:

On Tue, Sep 9, 2014 at 12:01 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

+ Although it could be argued that Lehman and Yao isn't followed to the
+ letter because single pages are read locked as the tree is descended,
+ this is at least necessary to support deletion, a common requirement
+ which LY hardly acknowledge.  Read locks also ensure that B-tree
+ pages are self-consistent (LY appear to assume atomic page reads and
+ writes).


This is just duplicating the existing paragraph. I don't see the point of
this.


The point is to make clear the reason for the differences - evidently,
Amit felt it was unclear why we don't follow LY. I am suggesting that
LY's talk of having no read locks is unrealistic (it might be
realistic in the 21st century, but that's a whole other story).


IMHO the existing paragraph does a much better job explaining that:


Lehman and Yao don't require read locks, but assume that in-memory
copies of tree pages are unshared.  Postgres shares in-memory buffers
among backends.  As a result, we do page-level read locking on btree
pages in order to guarantee that no record is modified while we are
examining it.


Amit: did you notice that paragraph in the README? If not, and now that 
you have read it, does that paragraph make things clear? If that's not 
enough, what do you think is missing?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 2:15 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Amit: did you notice that paragraph in the README? If not, and now that you
 have read it, does that paragraph make things clear? If that's not enough,
 what do you think is missing?

Amit raised the fact that LY say that no read locks are required, so
clearly he read that paragraph. We don't try and justify that right
now, and trying to explain the whole point of LY without first
explaining this satisfactorily seems odd. The current text reads:


Lehman and Yao don't require read locks, but assume that in-memory
copies of tree pages are unshared.


If they assume that they're unshared, uh, then why bother with all
that locking stuff? Or does this mean that page reads and writes are
(dubiously) assumed atomic without locks? If you take a step back, you
can see how confusing that is. LY don't get around to explaining
this, but it's pretty clear that this is what's going on. If LY did a
better job of explaining their algorithm, they'd just say that the
latch coupling (coupling, or crabbing, of what we'd call buffer
locks) previously required is no longer required, but single shared
locks are required. As I pointed out, it looks like someone figured
out a way to make that true much, much later [1]. I think page-level
MVCC designs might have also been used, but basically our
interpretation of LY is the standard one. We cannot really be
considered to have deviated from LY, since AFAICT everyone else went
with the same interpretation.

FWIW, in recent years Stonebraker has used latch coupling as an
example of the (implicitly no longer acceptable) overhead of designs
influenced by System R and ARIES. This is unfounded, but plenty of
bright people still at least believe that latch coupling is common.

[1] http://www.vldb.org/conf/2001/P181.pdf
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Robert Haas
On Fri, Sep 12, 2014 at 12:57 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Sep 12, 2014 at 2:15 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 Amit: did you notice that paragraph in the README? If not, and now that you
 have read it, does that paragraph make things clear? If that's not enough,
 what do you think is missing?

 Amit raised the fact that LY say that no read locks are required, so
 clearly he read that paragraph. We don't try and justify that right
 now, and trying to explain the whole point of LY without first
 explaining this satisfactorily seems odd. The current text reads:

 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.
 

 If they assume that they're unshared, uh, then why bother with all
 that locking stuff? Or does this mean that page reads and writes are
 (dubiously) assumed atomic without locks? If you take a step back, you
 can see how confusing that is. LY don't get around to explaining
 this, but it's pretty clear that this is what's going on. If LY did a
 better job of explaining their algorithm, they'd just say that the
 latch coupling (coupling, or crabbing, of what we'd call buffer
 locks) previously required is no longer required, but single shared
 locks are required. As I pointed out, it looks like someone figured
 out a way to make that true much, much later [1]. I think page-level
 MVCC designs might have also been used, but basically our
 interpretation of LY is the standard one. We cannot really be
 considered to have deviated from LY, since AFAICT everyone else went
 with the same interpretation.

Gosh, I think you're making this way more complicated than it needs to
be.  My interpretation of the above statement was that they knew
individual page reads and writes would need to be made atomic -
probably using some form of simple locking - but omitted that from
their pseudocode for clarity.  Such elisions are common in computer
science literature and don't typically detract from understanding.  It
is assumed that the implementor knows enough to avoid the trivial
pitfalls of whatever is being discussed and therefore focuses on the
higher-level algorithmic issues.

If this is what we're arguing about, it's completely not worth the
time we've spent on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 10:10 AM, Robert Haas robertmh...@gmail.com wrote:
 Gosh, I think you're making this way more complicated than it needs to
 be.  My interpretation of the above statement was that they knew
 individual page reads and writes would need to be made atomic -
 probably using some form of simple locking - but omitted that from
 their pseudocode for clarity.

That clearly isn't the case. The introductory paragraph of LY says
the following:

Our solution compares favorably with earlier solutions in that the
locking scheme is simpler (no read-locks are used) and only a (small)
constant number of nodes are locked by any update process at any given
time.

They clearly and prominently state that not needing read locks is a
major advantage of their algorithm, which doesn't quite ring true.

 If this is what we're arguing about, it's completely not worth the
 time we've spent on it.

It isn't. It's a minor point, originally raised by Amit.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Kevin Grittner
Peter Geoghegan p...@heroku.com wrote:

 The introductory paragraph of LY says the following:

 Our solution compares favorably with earlier solutions in that the
 locking scheme is simpler (no read-locks are used) and only a (small)
 constant number of nodes are locked by any update process at any given
 time.

 They clearly and prominently state that not needing read locks is a
 major advantage of their algorithm, which doesn't quite ring true.

It's been a while since I read that paper, but my recollection is
that they assumed that each process or thread looking at a buffer
would have its own private copy of that buffer, which it could be
sure nobody was changing (even if the master copy somewhere else
was changing).  Locking was only needed to prevent conflicting
writes.  Now, whether it is safe to assume that creating a
process-local buffer and copying to it is cheaper than getting a
lock seems dicey, but that seemed to be the implicit assumption.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Peter Geoghegan
On Fri, Sep 12, 2014 at 12:39 PM, Kevin Grittner kgri...@ymail.com wrote:
 It's been a while since I read that paper, but my recollection is
 that they assumed that each process or thread looking at a buffer
 would have its own private copy of that buffer, which it could be
 sure nobody was changing (even if the master copy somewhere else
 was changing).  Locking was only needed to prevent conflicting
 writes.  Now, whether it is safe to assume that creating a
 process-local buffer and copying to it is cheaper than getting a
 lock seems dicey, but that seemed to be the implicit assumption.

That is one way to make reads atomic, but I don't recall any explicit
mention of it. In 1981, I think page sizes were about the same as
today, but 4K was a lot of memory. We could actually do this, with
some work. I think that this has actually been implemented elsewhere,
though. Note that LY have practically nothing to say about deletion -
they simply suggest that it be done offline.

It is really useful that we can recover from page splits as and when
problems arise. That's really what I'd like to prominently convey - it
is the whole point of LY.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-12 Thread Amit Kapila
On Fri, Sep 12, 2014 at 2:45 PM, Heikki Linnakangas hlinnakan...@vmware.com
wrote:

 On 09/11/2014 11:47 PM, Peter Geoghegan wrote:

 On Tue, Sep 9, 2014 at 12:01 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:

 + Although it could be argued that Lehman and Yao isn't followed to the
 + letter because single pages are read locked as the tree is descended,
 + this is at least necessary to support deletion, a common requirement
 + which LY hardly acknowledge.  Read locks also ensure that B-tree
 + pages are self-consistent (LY appear to assume atomic page reads and
 + writes).


 This is just duplicating the existing paragraph. I don't see the point
of
 this.


 The point is to make clear the reason for the differences - evidently,
 Amit felt it was unclear why we don't follow LY. I am suggesting that
 LY's talk of having no read locks is unrealistic (it might be
 realistic in the 21st century, but that's a whole other story).


 IMHO the existing paragraph does a much better job explaining that:

 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
 pages in order to guarantee that no record is modified while we are
 examining it.


 Amit: did you notice that paragraph in the README?

Yeah, I have noticed and even discussed in this thread that
it might be better to add new text after that paragraph.

 If not, and now that you have read it, does that paragraph make things
clear? If that's not enough, what do you think is missing?

The paragraph is quite clear to me and the only thing, I have
mentioned above is that if you (Peter) want to add anything
new in that regard, then it is about an additional point why
read locks are taken during traversal of tree which is something
he is trying to say in his new patch as below:
this is at least necessary to support deletion,

However it could have been added in the existing text as well.

Having said that, I think that was just a very minor thing which is
generally obvious and was not at all the main feedback for patch.
The main point was to explain about how does right-links at all
levels and high-key helps to detect and recover from concurrent
page splits which I think the patch has tried to explain, but honestly
the wording you have posted (under heading
The basic Lehman  Yao Algorithm
) is more clear to me.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-11 Thread Peter Geoghegan
On Tue, Sep 9, 2014 at 12:01 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 + Lehman and Yao don't require read locks, but assume that in-memory
 + copies of tree pages are unshared.

 This is the existing paragraph, just moved to different place in the README.

That's right - it seemed to make just as much sense to talk about that
here, while doing so establishes the context of talking about how what
we do differs from the canonical algorithm (which I think is true of
real-world LY implementations generally). Which makes the next
paragraph easier to understand:

 + Although it could be argued that Lehman and Yao isn't followed to the
 + letter because single pages are read locked as the tree is descended,
 + this is at least necessary to support deletion, a common requirement
 + which LY hardly acknowledge.  Read locks also ensure that B-tree
 + pages are self-consistent (LY appear to assume atomic page reads and
 + writes).

 This is just duplicating the existing paragraph. I don't see the point of
 this.

The point is to make clear the reason for the differences - evidently,
Amit felt it was unclear why we don't follow LY. I am suggesting that
LY's talk of having no read locks is unrealistic (it might be
realistic in the 21st century, but that's a whole other story).

  Even with these read locks, following LY obviates the need
 + of earlier schemes to hold multiple locks concurrently when descending
 + the tree as part of servicing index scans (pessimistic lock coupling).
 + The addition of right-links at all levels, as well as the addition of
 + a page high key allows detection and dynamic recovery from
 + concurrent page splits (that is, splits between unlocking an internal
 + page, and subsequently locking its child page during a descent).

 This explains in a few sentences what a LY B-tree looks like. The current
 README assumes that you're already familiar with what a LY tree looks like,
 or that you go read the paper mentioned at the top of the README. It might
 be a good idea to expand on that, and add an introduction like this so that
 an unfamiliar reader doesn't need to read the LY paper first. Is that the
 purpose of this patch? Please make it more explicit.

Yes, although even LY don't get around to telling the reader why they
should care, the mistake that many good papers make. We now know its
significance, both in general and to Postgres. Framing the discussion
like this aids understanding more than you'd think.

 And please make the
 sentences simpler - the above reads like a Shakespeare play.

Out, damned lock!

 Something like:
 The basic Lehman  Yao Algorithm
 

 Compared to a classic B-tree, LY adds a right-link pointer to each page, to
 the page's right sibling. It also adds a high key to each page, which is
 an upper bound on the keys that are allowed on that page. These two
 additions make it possible detect a concurrent page split, which allows the
 tree to be searched without holding any read locks (except to keep a single
 page from being modified while reading it).

 When a search follows a downlink to a child page, it compares the page's
 high key with the search key. If the search key is greater than the high
 key, the page must've been split concurrently, and you must follow the
 right-link to find the new page containing the key range you're looking for.
 This might need to be repeated, if the page has been split more than once.

 Differences to the Lehman  Yao algorithm
 =

 (current Lehamn and Yao Algorithm and Insertions section)



 I think that's pretty much the same information you added, but it's in a
 separate section, with the clear purpose that it explains what a LY tree
 looks like. You can skip over it, if you have read the paper or are
 otherwise already familiar with it. It still assumes that you're familiar
 with B-trees in general.

That seems fair enough - I'd just expand on why we don't completely
avoid read locks, which LY suppose we can get away with. That is
clearly a point of confusion.

 Anyway, I see that you had resurrected this in the commitfest app after
 three weeks of inactivity. I'm going to mark this back to Returned with
 Feedback. Please don't resurrect it again, this patch has received more
 than its fair share of attention.

I didn't mean to suggest that it deserves much attention. I didn't
know how to interpret the fact that you changed the status, since in
fact not much additional work was required. I was busy throughout, for
reasons that are perhaps obvious. I am not fussed about when this
happens, but I really think we should get around to it.

 Instead, please help by signing up to
 review a patch. The commitfest progress is glacial at the moment, and we
 really need experienced reviewers like you to get closure to people's
 patches.

I'm back from holidays now. I plan to look at the Tomas Vondra's
patch. If you think I should look at some particular 

Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-09 Thread Heikki Linnakangas

On 09/07/2014 05:58 AM, Peter Geoghegan wrote:

+ Lehman and Yao don't require read locks, but assume that in-memory
+ copies of tree pages are unshared.  Postgres shares in-memory buffers
+ among backends.  As a result, we do page-level read locking on btree
+ pages in order to guarantee that no record is modified while we are
+ examining it.  This reduces concurrency but guarantees correct
+ behavior.  An advantage is that when trading in a read lock for a
+ write lock, we need not re-read the page after getting the write lock.
+ Since we're also holding a pin on the shared buffer containing the
+ page, we know that buffer still contains the page and is up-to-date.


This is the existing paragraph, just moved to different place in the README.


+ Although it could be argued that Lehman and Yao isn't followed to the
+ letter because single pages are read locked as the tree is descended,
+ this is at least necessary to support deletion, a common requirement
+ which LY hardly acknowledge.  Read locks also ensure that B-tree
+ pages are self-consistent (LY appear to assume atomic page reads and
+ writes).


This is just duplicating the existing paragraph. I don't see the point 
of this.



 Even with these read locks, following LY obviates the need
+ of earlier schemes to hold multiple locks concurrently when descending
+ the tree as part of servicing index scans (pessimistic lock coupling).
+ The addition of right-links at all levels, as well as the addition of
+ a page high key allows detection and dynamic recovery from
+ concurrent page splits (that is, splits between unlocking an internal
+ page, and subsequently locking its child page during a descent).  When
+ a page is first locked (at every level of a descent servicing an index
+ scan), we consider the need to move right:  if the scankey value is
+ less than (or sometimes less than or equal to) the page's existing
+ highkey value, a value which serves as an upper bound for values on
+ the page generally, then it must be necessary to move the scan to the
+ right-hand page on the same level.  It's even possible that the scan
+ needs to move right more than once.  Once the other session's
+ concurrent page split finishes, a downlink will be inserted into the
+ parent, and so assuming there are no further page splits, future index
+ scans using the same scankey value will not need to move right.  LY
+ Trees are sometimes referred to as B-Link trees in the literature.


This explains in a few sentences what a LY B-tree looks like. The 
current README assumes that you're already familiar with what a LY tree 
looks like, or that you go read the paper mentioned at the top of the 
README. It might be a good idea to expand on that, and add an 
introduction like this so that an unfamiliar reader doesn't need to read 
the LY paper first. Is that the purpose of this patch? Please make it 
more explicit. And please make the sentences simpler - the above reads 
like a Shakespeare play. Something like:


The basic Lehman  Yao Algorithm


Compared to a classic B-tree, LY adds a right-link pointer to each 
page, to the page's right sibling. It also adds a high key to each 
page, which is an upper bound on the keys that are allowed on that page. 
These two additions make it possible detect a concurrent page split, 
which allows the tree to be searched without holding any read locks 
(except to keep a single page from being modified while reading it).


When a search follows a downlink to a child page, it compares the page's 
high key with the search key. If the search key is greater than the high 
key, the page must've been split concurrently, and you must follow the 
right-link to find the new page containing the key range you're looking 
for. This might need to be repeated, if the page has been split more 
than once.


Differences to the Lehman  Yao algorithm
=

(current Lehamn and Yao Algorithm and Insertions section)



I think that's pretty much the same information you added, but it's in a 
separate section, with the clear purpose that it explains what a LY 
tree looks like. You can skip over it, if you have read the paper or are 
otherwise already familiar with it. It still assumes that you're 
familiar with B-trees in general.


Anyway, I see that you had resurrected this in the commitfest app after 
three weeks of inactivity. I'm going to mark this back to Returned with 
Feedback. Please don't resurrect it again, this patch has received more 
than its fair share of attention. Instead, please help by signing up to 
review a patch. The commitfest progress is glacial at the moment, and we 
really need experienced reviewers like you to get closure to people's 
patches.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-09-06 Thread Peter Geoghegan
On Tue, Jul 22, 2014 at 10:49 PM, Peter Geoghegan p...@heroku.com wrote:
 Basically I think it will be better if you can explain in bit more detail
 that
 how does right-links at all levels and high-key helps to detect and
 recover from concurrent page splits.

 You might be right about that - perhaps I should go into more detail.

I've gone into more detail on what a high key is, and how we arguably
do not follow Lehman  Yao to the letter because we still have read
locks. LY's assumption of atomic page reads/writes, and cursory
handling of deletion kind of make it inevitable that read locks are
used, which I now imply. So nbtree isn't a substandard LY
implementation - it's a realistic one, which only needs to hold a
single read lock at a time when servicing index scans (or when finding
a place for insertion). I guess Lehman  Yao preferred to put forward
the claim no read locks rather than only one read lock at a time on
internal pages, even for insertion because there might be some uses
of their algorithm where that is actually realistic. It is really a
sympathetic way of spinning things to say that *no* read locks are
used, though.

-- 
Peter Geoghegan
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
new file mode 100644
index 4820f76..8285050
*** a/src/backend/access/nbtree/README
--- b/src/backend/access/nbtree/README
*** use a simplified version of the deletion
*** 11,16 
--- 11,50 
  Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
  Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).
  
+ Lehman and Yao don't require read locks, but assume that in-memory
+ copies of tree pages are unshared.  Postgres shares in-memory buffers
+ among backends.  As a result, we do page-level read locking on btree
+ pages in order to guarantee that no record is modified while we are
+ examining it.  This reduces concurrency but guarantees correct
+ behavior.  An advantage is that when trading in a read lock for a
+ write lock, we need not re-read the page after getting the write lock.
+ Since we're also holding a pin on the shared buffer containing the
+ page, we know that buffer still contains the page and is up-to-date.
+ 
+ Although it could be argued that Lehman and Yao isn't followed to the
+ letter because single pages are read locked as the tree is descended,
+ this is at least necessary to support deletion, a common requirement
+ which LY hardly acknowledge.  Read locks also ensure that B-tree
+ pages are self-consistent (LY appear to assume atomic page reads and
+ writes).  Even with these read locks, following LY obviates the need
+ of earlier schemes to hold multiple locks concurrently when descending
+ the tree as part of servicing index scans (pessimistic lock coupling).
+ The addition of right-links at all levels, as well as the addition of
+ a page high key allows detection and dynamic recovery from
+ concurrent page splits (that is, splits between unlocking an internal
+ page, and subsequently locking its child page during a descent).  When
+ a page is first locked (at every level of a descent servicing an index
+ scan), we consider the need to move right:  if the scankey value is
+ less than (or sometimes less than or equal to) the page's existing
+ highkey value, a value which serves as an upper bound for values on
+ the page generally, then it must be necessary to move the scan to the
+ right-hand page on the same level.  It's even possible that the scan
+ needs to move right more than once.  Once the other session's
+ concurrent page split finishes, a downlink will be inserted into the
+ parent, and so assuming there are no further page splits, future index
+ scans using the same scankey value will not need to move right.  LY
+ Trees are sometimes referred to as B-Link trees in the literature.
+ 
  The Lehman and Yao Algorithm and Insertions
  ---
  
*** to be inserted has a choice whether or n
*** 42,57 
  key could go on either page.  (Currently, we try to find a page where
  there is room for the new key without a split.)
  
- Lehman and Yao don't require read locks, but assume that in-memory
- copies of tree pages are unshared.  Postgres shares in-memory buffers
- among backends.  As a result, we do page-level read locking on btree
- pages in order to guarantee that no record is modified while we are
- examining it.  This reduces concurrency but guarantees correct
- behavior.  An advantage is that when trading in a read lock for a
- write lock, we need not re-read the page after getting the write lock.
- Since we're also holding a pin on the shared buffer containing the
- page, we know that buffer still contains the page and is up-to-date.
- 
  We support the notion of an ordered scan of an index as well as
  insertions, deletions, and simple lookups.  A scan in the forward
  direction is no problem, we just use the right-sibling pointers that
--- 

Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-24 Thread Amit Kapila
On Thu, Jul 24, 2014 at 9:28 AM, Peter Geoghegan p...@heroku.com wrote:
 On Wed, Jul 23, 2014 at 8:52 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  As such there is no problem in saying the way you have mentioned, but
  I feel it would be better if we can mention the mechanism of
_bt_search()
  as quoted by you upthread in the first line.
   In more concrete terms, _bt_search() releases and only then acquires
  read locks during a descent of the tree (by calling
  _bt_relandgetbuf()), and, perhaps counterintuitively, that's just
  fine.

 I guess I could say that too.

Okay.

  One more point, why you think it is important to add this new text
  on top?  I think adding new text after Lehman and Yao don't require
read
  locks, .. paragraph is okay.

 I've added it to the top because it's really the most important point
 on Lehman and Yao. It's the _whole_ point. Consider how it's
 introduced here, for example:
 http://db.cs.berkeley.edu/jmh/cs262b/treeCCR.html

 Why should I bury the lead?

I think even if you want to keep it at top, may be we could have another
heading like : Concurrency Considerations with Lehman  Yao Approach

However, I think we can leave this point for Committer to decide.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-23 Thread Amit Kapila
On Wed, Jul 23, 2014 at 11:19 AM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jul 22, 2014 at 8:59 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  Okay, but how does this justify to add below new text in README.
  + Even with these read locks, Lehman and Yao's approach obviates the
  + need of earlier schemes to hold multiple read locks concurrently when
  + descending the tree as part of servicing index scans (pessimistic lock
  + coupling).
 
  Actually I think putting it can lead to inconsistency in the README.
  Currently it indicates that our algorithm is different from LY w.r.t
taking
  Read Locks and has given explanation for same.

 Not really. Firstly, that sentence acknowledges that there are read
 locks where LY assume there will not be. Even with these read locks
 references the first paragraph, where it is stated the Postgres
 B-Trees still acquire read locks while descending the tree.

I think here you want to state that the difference in Postgres is as we are
using L  Y approach, it don't need to hold *multiple* read locks
concurrently,
and L  Y approach which obviates this need is explained in second line
(which indicates the importance of maintaining right-links and high-keys to
detect and recover from page splits).

As such there is no problem in saying the way you have mentioned, but
I feel it would be better if we can mention the mechanism of _bt_search()
as quoted by you upthread in the first line.
 In more concrete terms, _bt_search() releases and only then acquires
 read locks during a descent of the tree (by calling
 _bt_relandgetbuf()), and, perhaps counterintuitively, that's just
 fine.

One more point, why you think it is important to add this new text
on top?  I think adding new text after Lehman and Yao don't require read
locks, .. paragraph is okay.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-23 Thread Peter Geoghegan
On Wed, Jul 23, 2014 at 8:52 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 As such there is no problem in saying the way you have mentioned, but
 I feel it would be better if we can mention the mechanism of _bt_search()
 as quoted by you upthread in the first line.
  In more concrete terms, _bt_search() releases and only then acquires
 read locks during a descent of the tree (by calling
 _bt_relandgetbuf()), and, perhaps counterintuitively, that's just
 fine.

I guess I could say that too.

 One more point, why you think it is important to add this new text
 on top?  I think adding new text after Lehman and Yao don't require read
 locks, .. paragraph is okay.

I've added it to the top because it's really the most important point
on Lehman and Yao. It's the _whole_ point. Consider how it's
introduced here, for example:
http://db.cs.berkeley.edu/jmh/cs262b/treeCCR.html

Why should I bury the lead?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-22 Thread Peter Geoghegan
On Mon, Jul 21, 2014 at 9:55 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 There is a mention about the race condition where it needs to move right
 in another caller (_bt_search) of _bt_moveright() as well.

 /*
 * Race -- the page we just grabbed may have split since we read its
 * pointer in the parent (or metapage).  If it has, we may need to
 * move right to its new sibling.  Do that.
 ..

 Do you think there is more to what is already mentioned on top of second
 caller which we should add or you think if it is true for both, then it
 should
 be on top of _bt_moveright()?

Well, maybe it is justified to mention it multiple times, so as to not
break the reader's train of thought. I'm not sure.

 In general, I agree with you that we should mention about any advantage
 of the algorithm we are using and especially if it is significant.  I think
 it
 will be better if can also mention how that advantage or use is realized
 in our implementation as we are already doing in README of nbtree.

Right. It seems like the nbtree README is very shy about telling us
what the point of all this extra work is. IMV that should be stated
very prominently to aid understanding. A lot of people believe that we
have to do lock coupling/crabbing when descending the tree. We do
not. The locks acquired when descending a B-Tree in Postgres are
pretty granular. One read buffer lock is held at a time in the process
of servicing index scans. There are times during the descent of the
tree when no buffer locks are held whatsoever. Moreover, (with some
caveats) it doesn't really matter if a stale view of the page is seen
during a descent (as it happens I've been trying to think of ways to
further take advantage of this). That's pretty cool. If you only know
one thing about how the nbtree code works, that should probably be it.

 The above indicates 2 things:
 a. L  Y doesn't need to hold read locks concurrently.
 b. Advantage of right-links at all levels and high-key.

 As per my understanding, we are not following point (a) in our code,
 so what is the benefit we get by having a reference of same in README?

The major reason that we don't completely avoid read locks, is, I
suppose, the need for self-consistent pages (but also because it would
break page deletion - I'm pretty sure that LY don't consider page
deletion, and the page deletion logic is entirely based on the Lanin
and Shasha paper and original research, but I didn't check). I think
that the sentence Lehman and Yao don't require read locks, but assume
that in-memory copies of tree pages are unshared is intended to
convey the point on the self-consistency of pages. Of course, Lehman
and Yao must assume that the B-Tree is in some sense in shared memory.
Otherwise, there wouldn't be much point to their elaborate locking
protocol.  :-)

 Isn't it better if we mention how the point (b) is used in our code and
 it's advantage together rather than keeping it at top of README?

Maybe it deserves more prominent mention in the code too.

 Already README mentions in brief about right-link and how it is used
 as below:
 .. The scan must remember the page's right-link at the time it was scanned,
 since that is the page to move right to; if we move right to the current
 right-link then we'd re-scan any items moved by a page split. ...

This is talking about how index scans interlock against VACUUM while
going to the heap, by keeping a leaf page pinned (this prevents super
exclusive lock acquisition). This is after the tree has been
descended. This is really just a detail (albeit one that follows
similar principles, since pages split right and it similarly exploits
that fact), whereas the use of right links and high keys while
descending the tree, and in particular the fact that the move right
LY technique obviates the prior need for lock coupling is pretty
much the whole point of LY.

In more concrete terms, _bt_search() releases and only then acquires
read locks during a descent of the tree (by calling
_bt_relandgetbuf()), and, perhaps counterintuitively, that's just
fine.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-22 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 Right. It seems like the nbtree README is very shy about telling us
 what the point of all this extra work is.

IIRC, the README was written on the assumption that you'd already read
LY.  If this patch is mostly about not assuming that, why not?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-22 Thread Amit Kapila
On Wed, Jul 23, 2014 at 5:58 AM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Jul 21, 2014 at 9:55 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  The above indicates 2 things:
  a. L  Y doesn't need to hold read locks concurrently.
  b. Advantage of right-links at all levels and high-key.
 
  As per my understanding, we are not following point (a) in our code,
  so what is the benefit we get by having a reference of same in README?

 The major reason that we don't completely avoid read locks, is, I
 suppose, the need for self-consistent pages (but also because it would
 break page deletion - I'm pretty sure that LY don't consider page
 deletion, and the page deletion logic is entirely based on the Lanin
 and Shasha paper and original research, but I didn't check). I think
 that the sentence Lehman and Yao don't require read locks, but assume
 that in-memory copies of tree pages are unshared is intended to
 convey the point on the self-consistency of pages. Of course, Lehman
 and Yao must assume that the B-Tree is in some sense in shared memory.
 Otherwise, there wouldn't be much point to their elaborate locking
 protocol.  :-)

Okay, but how does this justify to add below new text in README.
+ Even with these read locks, Lehman and Yao's approach obviates the
+ need of earlier schemes to hold multiple read locks concurrently when
+ descending the tree as part of servicing index scans (pessimistic lock
+ coupling).

Actually I think putting it can lead to inconsistency in the README.
Currently it indicates that our algorithm is different from LY w.r.t taking
Read Locks and has given explanation for same.

  Isn't it better if we mention how the point (b) is used in our code and
  it's advantage together rather than keeping it at top of README?

 Maybe it deserves more prominent mention in the code too.

  Already README mentions in brief about right-link and how it is used
  as below:
  .. The scan must remember the page's right-link at the time it was
scanned,
  since that is the page to move right to; if we move right to the current
  right-link then we'd re-scan any items moved by a page split. ...

 This is talking about how index scans interlock against VACUUM while
 going to the heap, by keeping a leaf page pinned (this prevents super
 exclusive lock acquisition). This is after the tree has been
 descended. This is really just a detail (albeit one that follows
 similar principles, since pages split right and it similarly exploits
 that fact), whereas the use of right links and high keys while
 descending the tree, and in particular the fact that the move right
 LY technique obviates the prior need for lock coupling is pretty
 much the whole point of LY.

 In more concrete terms, _bt_search() releases and only then acquires
 read locks during a descent of the tree (by calling
 _bt_relandgetbuf()), and, perhaps counterintuitively, that's just
 fine.

So don't you think that it needs bit more explanation than you have
quoted in below text.
+ The addition of right-links at all levels, as well as the
+ addition of a page high key allows detection of, and dynamic
+ recovery from concurrent page splits (that is, splits between
+ unlocking an internal page, and subsequently locking its child page
+ during a descent).

Basically I think it will be better if you can explain in bit more detail
that
how does right-links at all levels and high-key helps to detect and
recover from concurrent page splits.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-22 Thread Peter Geoghegan
On Tue, Jul 22, 2014 at 5:39 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 IIRC, the README was written on the assumption that you'd already read
 LY.  If this patch is mostly about not assuming that, why not?

LY made the same mistake that the authors of most influential papers
make - they never get around to telling the reader why they should
bother to read it. The paper is over 30 years old, and we now know
that it's very influential, and the reasons why. I think that both the
nbtree README and LY would be a lot more approachable with a high
level introduction (arguably LY attempt this, but the way they go
about it seems impenetrable, mostly consisting of esoteric references
to other papers). Surely making that code more approachable is a
worthy goal.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-22 Thread Peter Geoghegan
On Tue, Jul 22, 2014 at 8:59 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 Okay, but how does this justify to add below new text in README.
 + Even with these read locks, Lehman and Yao's approach obviates the
 + need of earlier schemes to hold multiple read locks concurrently when
 + descending the tree as part of servicing index scans (pessimistic lock
 + coupling).

 Actually I think putting it can lead to inconsistency in the README.
 Currently it indicates that our algorithm is different from LY w.r.t taking
 Read Locks and has given explanation for same.

Not really. Firstly, that sentence acknowledges that there are read
locks where LY assume there will not be. Even with these read locks
references the first paragraph, where it is stated the Postgres
B-Trees still acquire read locks while descending the tree. Secondly,
I'm pretty sure that even Lehman and Yao realized that their apparent
assumption that real implementations would not require read locks
isn't realistic. Their handling of deletion seems perfunctory to me.
They say In situations where excessive deletions cause the storage
utilization of tree nodes to be unacceptably low, a batch
reorganization or an underflow operation which locks the entire tree
can be performed. I'm pretty sure that that sounded almost as bad in
1980 as it does now. We don't have a not quite LY implementation
just because there are single read locks acquired while descending the
tree. Prior schemes needed multiple *concurrent* exclusive locks.
B-Trees were around for about 10 years before LY.

There is reason to think that pretty much every practical
implementation uses read locks for many years, because there is a well
received 2001 paper [1] that describes a scheme where LY style B-link
trees can *actually* be made to not require read locks, which
discusses things like caching effects on contemporary hardware - it
involves playing tricks with detecting and recovering from page level
inconsistencies, IIRC. Furthermore, it references a scheme from the
late 90s involving local copies of B-Link pages. I thought about
pursuing something like that myself, but the cost of latching
(buffer locking) B-Trees in PostgreSQL is likely to be reduced before
too long anyway, which makes the general idea seem unenticing right
now.

 Basically I think it will be better if you can explain in bit more detail
 that
 how does right-links at all levels and high-key helps to detect and
 recover from concurrent page splits.

You might be right about that - perhaps I should go into more detail.

[1] http://www.vldb.org/conf/2001/P181.pdf
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-21 Thread Amit Kapila
On Tue, May 27, 2014 at 1:52 AM, Peter Geoghegan p...@heroku.com wrote:

 While talking to various people during pgCon, I was reminded that the
 nbtree README does a poor job of explaining the actual practical
 advantages of LY from a high level. The move right trick is
 initially mentioned only as an adjunct to a discussion of the
 special-case handling of root page splits, even though it's of huge
 significance. We also say this within nbtinsert.c:

 /*
  * If the page was split between the time that we surrendered our read
  * lock and acquired our write lock, then this page may no longer be the
  * right place for the key we want to insert.  In this case, we need to
  * move right in the tree.  See Lehman and Yao for an excruciatingly
  * precise description.
  */

 (Why we need to say something about _bt_moveright() within
 _bt_doinsert() that is equally true of both _bt_moveright() callers
 isn't clear).

There is a mention about the race condition where it needs to move right
in another caller (_bt_search) of _bt_moveright() as well.

/*
* Race -- the page we just grabbed may have split since we read its
* pointer in the parent (or metapage).  If it has, we may need to
* move right to its new sibling.  Do that.
..

Do you think there is more to what is already mentioned on top of second
caller which we should add or you think if it is true for both, then it
should
be on top of _bt_moveright()?

 I think that this isn't enough. Attached patch proposes to add a small
 paragraph at the top of the nbtree README, to clarify the advantages
 of LY from a high level.  I don't think it's appropriate to state the
 advantages of an algorithm in a README file generally, but in this
 instance it makes it easier to understand the algorithm.

In general, I agree with you that we should mention about any advantage
of the algorithm we are using and especially if it is significant.  I think
it
will be better if can also mention how that advantage or use is realized
in our implementation as we are already doing in README of nbtree.

+
+ Even with these read locks, Lehman and Yao's approach obviates the
+ need of earlier schemes to hold multiple read locks concurrently when
+ descending the tree as part of servicing index scans (pessimistic lock
+ coupling).  The addition of right-links at all levels, as well as the
+ addition of a page high key allows detection of, and dynamic
+ recovery from concurrent page splits (that is, splits between
+ unlocking an internal page, and subsequently locking its child page
+ during a descent).  LY Trees are sometimes referred to as B-Link
+ trees in the literature.
+

The above indicates 2 things:
a. L  Y doesn't need to hold read locks concurrently.
b. Advantage of right-links at all levels and high-key.

As per my understanding, we are not following point (a) in our code,
so what is the benefit we get by having a reference of same in README?

Isn't it better if we mention how the point (b) is used in our code and
it's advantage together rather than keeping it at top of README?

Already README mentions in brief about right-link and how it is used
as below:
.. The scan must remember the page's right-link at the time it was scanned,
since that is the page to move right to; if we move right to the current
right-link then we'd re-scan any items moved by a page split. ...


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-02 Thread Peter Geoghegan
On Mon, May 26, 2014 at 1:22 PM, Peter Geoghegan p...@heroku.com wrote:
 I think that this isn't enough. Attached patch proposes to add a small
 paragraph at the top of the nbtree README, to clarify the advantages
 of LY from a high level.


I've added this to the ongoing commitfest.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-02 Thread Abhijit Menon-Sen
At 2014-07-02 17:24:06 -0700, p...@heroku.com wrote:

 I've added this to the ongoing commitfest.

I don't actually understand why you were able to do that (seeing as this
CF is no longer open for new patches). Trivial or not, I think at this
point it should go into the next one. Or it should be handled outside
the CF altogether.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-02 Thread Michael Paquier
On Thu, Jul 3, 2014 at 11:40 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote:
 At 2014-07-02 17:24:06 -0700, p...@heroku.com wrote:

 I've added this to the ongoing commitfest.

 I don't actually understand why you were able to do that (seeing as this
 CF is no longer open for new patches). Trivial or not, I think at this
 point it should go into the next one. Or it should be handled outside
 the CF altogether.
+1. Even if it is just a doc patch, it has been submitted after 6/15,
which was the CF1 deadline.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-02 Thread Abhijit Menon-Sen
At 2014-07-03 11:59:21 +0900, michael.paqu...@gmail.com wrote:

 +1. Even if it is just a doc patch, it has been submitted after 6/15,
 which was the CF1 deadline.

Well, to be fair, the original patch was posted to the list more than a
month ago, and it should have been in this CF. But… it wasn't. And now
after more than two weeks into this CF, I don't think it should be.

I moved it to 2014-08. (Sorry, Peter.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stating the significance of Lehman Yao in the nbtree README

2014-07-02 Thread Peter Geoghegan
On Wed, Jul 2, 2014 at 8:14 PM, Abhijit Menon-Sen a...@2ndquadrant.com wrote:
 Well, to be fair, the original patch was posted to the list more than a
 month ago, and it should have been in this CF. But… it wasn't. And now
 after more than two weeks into this CF, I don't think it should be.

Is that how the rule is interpreted? Okay, I defer to you. I guess
I've just never seen a situation where that distinction needed to be
drawn come up before.

 I moved it to 2014-08. (Sorry, Peter.)

I don't mind if no one looks at this until then. I agree that this is
exactly the kind of thing that generally doesn't need to be handled
through a commitfest submission. The only reason I added this to any
commitfest was to avoid having it be forgotten about entirely, which
there is a real danger of for something like this when it isn't
handled quickly. I almost forgot about it myself.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers