Re: Optimize single tuple fetch from nbtree index

2019-08-27 Thread Floris Van Nee

>> It seems that it contradicts the very idea of your patch, so probably we
>> should look for other ways to optimize this use-case.
>> Maybe this restriction can be relaxed for write only tables, that never
>> have to reread the page because of visibility, or something like that.
>> Also we probably can add to IndexScanDescData info about expected number
>> of tuples, to allow index work more optimal
>> and avoid the overhead for other loads.=

> The idea of the patch is exactly to relax this limitation. I forgot to update 
> that README file though. The current implementation of the patch should be 
> correct like this - that's why I added the look-back code on the page if the 
> tuple couldn't be found anymore on the same location on the page. Similarly, 
> it'll look on the page to the right if it detected a page split. These two 
> measures combined should give a correct implementation of the 'it's possible 
> that a scan stops in the middle of a page' relaxation. However, as Peter and 
> Tom pointed out earlier, they feel that the performance advantage that this 
> approach gives, does not outweigh the extra complexity at this time. I'd be 
> open to other suggestions though.

Although now that I think of it - do you mean the case where the tuple that we 
returned to the caller after _bt_first actually gets deleted (not moved) from 
the page? I guess that can theoretically happen if _bt_first returns a 
non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For 
my understanding, would a situation like the following lead to this (in my 
patch)?
1) Backend 1 does an index scan and returns the first tuple on _bt_first - this 
tuple is actually deleted in the heap already, however it's not marked dead yet 
in the index.
2) Backend 1 does a heap fetch to check actual visibility and determines the 
tuple is actually dead
3) While backend 1 is busy doing the heap fetch (so in between _bt_first and 
_bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on 
the page to mark tuples dead as well as 2) compact items on the page, thereby 
actually removing this item from the page.
4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to 
locate the tuple where it left off, but cannot find it anymore because it got 
removed completely by backend 2.

If this is indeed possible then it's a bad issue unfortunately, and quite hard 
to try to reproduce, as a lot of things need to happen concurrently while doing 
a visiblity check.

As for your patch, I've had some time to take a look at it. For the two TODOs:

+   /* TODO Is it possible that currPage is not valid anymore? */
+   Assert(BTScanPosIsValid(so->currPos))

This Assert exists already a couple of lines earlier at the start of this 
function.

+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */

The comments in the source mention the following about this:
 * We note the buffer's block number so that we can release the 
pin later.
 * This allows us to re-read the buffer if it is needed again 
for hinting.
 */
so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);

As we figured out earlier, so->currPos.buf gets set to invalid when we release 
the pin by the unpin macro. So, if we don't store currPage number somewhere 
else, we cannot obtain the pin again if we need it during killitems. I think 
that's the reason that currPage is stored.

Other than the two TODOs in the code, I think the comments really help 
clarifying what's going on in the code - I'd be happy if this gets added.

-Floris




Re: Optimize single tuple fetch from nbtree index

2019-08-26 Thread Floris Van Nee

> It seems that it contradicts the very idea of your patch, so probably we
> should look for other ways to optimize this use-case.
> Maybe this restriction can be relaxed for write only tables, that never
> have to reread the page because of visibility, or something like that.
> Also we probably can add to IndexScanDescData info about expected number
> of tuples, to allow index work more optimal
> and avoid the overhead for other loads.=

The idea of the patch is exactly to relax this limitation. I forgot to update 
that README file though. The current implementation of the patch should be 
correct like this - that's why I added the look-back code on the page if the 
tuple couldn't be found anymore on the same location on the page. Similarly, 
it'll look on the page to the right if it detected a page split. These two 
measures combined should give a correct implementation of the 'it's possible 
that a scan stops in the middle of a page' relaxation. However, as Peter and 
Tom pointed out earlier, they feel that the performance advantage that this 
approach gives, does not outweigh the extra complexity at this time. I'd be 
open to other suggestions though.

> That's true. It took me quite some time to understand that existing code
> is correct.
> There is a comment for the structure's field that claims that
> BufferIsValid is the same that BufferIsPinned in ScanPos context.
> Attached patch contains some comments' updates. Any suggestions on how
> to improve them are welcome.

I'll have a look tomorrow. Thanks a lot for writing this up!

-Floris




Re: Optimize single tuple fetch from nbtree index

2019-08-26 Thread Anastasia Lubennikova

25.08.2019 0:59, Floris Van Nee wrote:

Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned
again between two reads,
and the condition of holding it continuously has not been met.

May I ask what makes you conclude that the condition of holding the pin 
continuously has not been met?
Your reply encouraged me to dig a little bit more into this today. First, I 
wanted to check if indeed the pin was continuously held by the backend or not. 
I added some debug info to ReleaseBuffer for this: it turned out that the pin 
on the buffer was definitely never released by the backend between the calls to 
_bt_first and _bt_next. So the buffer got compacted while the backend held a 
pin on it.
After some more searching I found the following code: _bt_vacuum_one_page in 
nbtinsert.c
This function compacts one single page without taking a super-exclusive lock. 
It is used during inserts to make room on a page. I verified that if I comment 
out the calls to this function, the compacting never happens while I have a pin 
on the buffer.
So I guess that answers my own question: cleaning up garbage during inserts is 
one of the cases where compacting may happen even while other backends hold a 
pin to the buffer. Perhaps this should also be more clearly phrased in the 
comments in eg. _bt_killitems? Because currently those comments make it look 
like this case never occurs.


You're right, the pin was not released between page reads.
I also added debug to UnpinBuffer, but now I see that I had interpreted 
it wrongly.


As far as I understand, the issue with your patch is that it breaks the 
*scan stops "between" pages* assumption

and thus it unsafely interacts with _bt_vacuum_one_page() cleanup.

See README:
>Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage.  The heap tuple IDs are then processed while
not holding any page lock within the index.  We do continue to hold a pin
on the leaf page in some circumstances, to protect against concurrent
deletions (see below).  In this state the scan is effectively stopped
"between" pages, either before or after the page it has pinned. This is
safe in the presence of concurrent insertions and even page splits, because
items are never moved across pre-existing page boundaries --- so the scan
cannot miss any items it should have seen, nor accidentally return the same
item twice.

and

>Once an index tuple has been marked LP_DEAD it can actually be removed
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion.

It seems that it contradicts the very idea of your patch, so probably we 
should look for other ways to optimize this use-case.
Maybe this restriction can be relaxed for write only tables, that never 
have to reread the page because of visibility, or something like that.
Also we probably can add to IndexScanDescData info about expected number 
of tuples, to allow index work more optimal

and avoid the overhead for other loads.=


While reading the code to answer your question, I noticed that
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().

I agree the name is misleading. It clearly does something else than how it's 
named. However, I don't believe this introduces problems in these particular 
pieces of code, as long as the macro's are always used. BTScanPosIsPinned 
actually checks whether it's valid and not necessarily whether it's pinned, as 
you mentioned. However, any time the buffer gets unpinned using the macro 
BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, 
any consecutive call to BTScanPosIsPinned should indeed return false. It'd 
definitely be nice if this gets clarified in comments though.


That's true. It took me quite some time to understand that existing code 
is correct.
There is a comment for the structure's field that claims that 
BufferIsValid is the same that BufferIsPinned in ScanPos context.
Attached patch contains some comments' updates. Any suggestions on how 
to improve them are welcome.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..316a42d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1751,6 +1751,8 @@ _bt_killitems(IndexScanDesc 

Re: Optimize single tuple fetch from nbtree index

2019-08-24 Thread Floris Van Nee

> Hello,
> welcome to hackers with your first patch)

Thank you.

> Though, I got interested in the comment inconsistency you have found.
> I added debug message into this code branch of the patch and was able to
> see it in regression.diffs after 'make check':
> Speaking of your patch, it seems that the buffer was unpinned and pinned
> again between two reads,
> and the condition of holding it continuously has not been met.

May I ask what makes you conclude that the condition of holding the pin 
continuously has not been met?
Your reply encouraged me to dig a little bit more into this today. First, I 
wanted to check if indeed the pin was continuously held by the backend or not. 
I added some debug info to ReleaseBuffer for this: it turned out that the pin 
on the buffer was definitely never released by the backend between the calls to 
_bt_first and _bt_next. So the buffer got compacted while the backend held a 
pin on it.
After some more searching I found the following code: _bt_vacuum_one_page in 
nbtinsert.c
This function compacts one single page without taking a super-exclusive lock. 
It is used during inserts to make room on a page. I verified that if I comment 
out the calls to this function, the compacting never happens while I have a pin 
on the buffer.
So I guess that answers my own question: cleaning up garbage during inserts is 
one of the cases where compacting may happen even while other backends hold a 
pin to the buffer. Perhaps this should also be more clearly phrased in the 
comments in eg. _bt_killitems? Because currently those comments make it look 
like this case never occurs.

> While reading the code to answer your question, I noticed that
> BTScanPosIsPinned macro name is misleading.
> It calls BufferIsValid(), not BufferIsPinned() as one could expect.
> And BufferIsValid in bufmgr.h comment explicitly states that it
> shouldn't be confused with BufferIsPinned.
> The same goes for BTScanPosUnpinIfPinned().

I agree the name is misleading. It clearly does something else than how it's 
named. However, I don't believe this introduces problems in these particular 
pieces of code, as long as the macro's are always used. BTScanPosIsPinned 
actually checks whether it's valid and not necessarily whether it's pinned, as 
you mentioned. However, any time the buffer gets unpinned using the macro 
BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, 
any consecutive call to BTScanPosIsPinned should indeed return false. It'd 
definitely be nice if this gets clarified in comments though.

-Floris




Re: Optimize single tuple fetch from nbtree index

2019-08-23 Thread Peter Geoghegan
On Fri, Aug 23, 2019 at 10:14 AM Anastasia Lubennikova
 wrote:
> Though, I got interested in the comment inconsistency you have found.
> I added debug message into this code branch of the patch and was able to
> see it in regression.diffs after 'make check':
> Speaking of your patch, it seems that the buffer was unpinned and pinned
> again between two reads,
> and the condition of holding it continuously has not been met.

See commit 2ed5b87f. The code is supposed to do that, but it might do
it more often than is truly necessary. We don't want to block VACUUM
by holding a buffer pin for a very long time, which is theoretically
possible here. Do you think that it is actually unnecessary here? In
other words, do you think that we can fix this without breaking cases
that commit 2ed5b87f cares about?

I have been suspicious of this commit all along. For example, I
noticed that it can cause the kill_prior_tuple mechanism to be
ineffective in a way that didn't happen prior to Postgres 9.5:

https://postgr.es/m/CAH2-Wz=sfakvmv1x9jh19ej8am8tzn9f-yecips9hrrrqss...@mail.gmail.com

That particular complaint was never addressed. I meant to do more on
commit 2ed5b87f.

> I didn't dig into the code, but this line looks suspicious (see my
> findings about BTScanPosIsPinned below):
>
>  /* bump pin on current buffer for assignment to mark buffer */
>  if (BTScanPosIsPinned(so->currPos))
>  IncrBufferRefCount(so->currPos.buf);
>
>
> While reading the code to answer your question, I noticed that
> BTScanPosIsPinned macro name is misleading.
> It calls BufferIsValid(), not BufferIsPinned() as one could expect.
> And BufferIsValid in bufmgr.h comment explicitly states that it
> shouldn't be confused with BufferIsPinned.
> The same goes for BTScanPosUnpinIfPinned().

I have always hated this macro. I think that code like the specific
code you quoted might be correct, kind of, but it looks like the
author was trying to change as little as possible about the code as it
existed in 2015, rather than changing things so that everything made
sense. It looks like a messy accretion.

Let me see if I can get it straight:

We're incrementing the ref count on the buffer if and only if it is
pinned (by which we mean valid), though only when the scan is valid
(which is not the same as pinned). Whether or not we increment the
count of a valid scan doesn't affect anything else we do (i.e. we
still restore a marked position either way).

This is just awful.

> I propose that we update BTScanPosIsPinned macro. Or, at least write a
> comment, why its current behavior is fine.
> There are a few existing callers, that are definitely expecting that
> this macro checks a pin, which it doesn't do.
> I don't quite understand if that already causes any subtle bug, or the
> current algorithm is fine.

I think that you're right -- at a minimum, this requires more
documentation. This code is a few years old, but I still wouldn't be
surprised if it turned out to be slightly wrong in a way that was
important. We still have no way of detecting if a buffer is accessed
without a pin. There have been numerous bugs like that before. (We
have talked about teaching Valgrind to detect the case, but that never
actually happened.)

--
Peter Geoghegan




Re: Optimize single tuple fetch from nbtree index

2019-08-23 Thread Anastasia Lubennikova

05.08.2019 14:34, Floris Van Nee wrote:

I have one further question about these index offsets. There are several 
comments in master that indicate that it's impossible that an item moves 'left' 
on a page, if we continuously hold a pin on the page. For example, 
_bt_killitems has a comment like this:
  
* Note that if we hold a pin on the target page continuously from initially

  * reading the items until applying this function, VACUUM cannot have deleted
  * any items from the page, and so there is no need to search left from the
  * recorded offset.  (This observation also guarantees that the item is still
  * the right one to delete, which might otherwise be questionable since heap
  * TIDs can get recycled.) This holds true even if the page has been 
modified
  * by inserts and page splits, so there is no need to consult the LSN.
  
Still, exactly this case happens in practice. In my tests I was able to get behavior like:

1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but*keep*  the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have 
occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved 
left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I 
added code in my patch to also look left on the page from the previous 
indexOffset.

However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part 
of Postgres that might move index items left even though there are others 
holding a pin?


Hello,
welcome to hackers with your first patch)

As far as I understood from the thread above, the design of this 
optimization is under discussion, so I didn't review the proposed patch 
itself.

Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to 
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned 
again between two reads,

and the condition of holding it continuously has not been met.

I didn't dig into the code, but this line looks suspicious (see my 
findings about BTScanPosIsPinned below):


        /* bump pin on current buffer for assignment to mark buffer */
        if (BTScanPosIsPinned(so->currPos))
            IncrBufferRefCount(so->currPos.buf);


While reading the code to answer your question, I noticed that 
BTScanPosIsPinned macro name is misleading.

It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it 
shouldn't be confused with BufferIsPinned.

The same goes for BTScanPosUnpinIfPinned().

I propose that we update BTScanPosIsPinned macro. Or, at least write a 
comment, why its current behavior is fine.
There are a few existing callers, that are definitely expecting that 
this macro checks a pin, which it doesn't do.
I don't quite understand if that already causes any subtle bug, or the 
current algorithm is fine.


Peter, Tom, what do you think?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Optimize single tuple fetch from nbtree index

2019-08-05 Thread Floris Van Nee
Hi Peter,

> Actually, having looked at the test case in more detail, that now
> seems less likely. The test case seems designed to reward making it
> cheaper to access one specific tuple among a fairly large group of
> related tuples -- reducing the number of inner scans is not going to
> be possible there.

> If this really is totally representative of the case that Floris cares
> about, I suppose that the approach taken more or less makes sense.
> Unfortunately, it doesn't seem like an optimization that many other
> users would find compelling, partly because it's only concerned with
> fixed overheads, and partly because most queries don't actually look
> like this.

Thanks for taking a look. Unfortunately this is exactly the case I care about. 
I'm a bit puzzled as to why this case wouldn't come up more often by other 
users though. We have many large tables with timeseries data and it seems to me 
that with timeseries data, two of the most common queries are:
(1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know 
that there's an update *exactly* at timepoint t - so you're left with trying to 
find the latest update smaller than t)
(2) What is the state of { a } at timepoints { t1, t2, t3 ... }
Given that a1,a2,a3... are indepedently updating, but similar time series (eg. 
sensor a1 and sensor a2, but both provide a temperature value and update 
independently from each other).
Both of these can also be done with some kind of DISTINCT ON clause, but this 
is often already much slower than just doing a nested loop of fast index 
lookups with LIMIT 1 (this depends on the frequency of the timeseries data 
itself versus the sampling rate of your query though, for high frequency time 
series and/or low frequency sampling the LIMIT 1 approach is much faster).

Note that there is actually some related work to this - in the Index Skip Scan 
thread [1] a function called _bt_read_closest was developed which also 
partially reads the page. A Skip Scan has a very similar access pattern to the 
use case I describe here, because it's also very likely to just require one 
tuple from the page. Even though the implementation in that patch is currently 
incorrect, performance of the Skip Scan would likely also be quite a bit faster 
if it had a correct implementation of this partial page-read and it wouldn't 
have to read the full page every time.

I have one further question about these index offsets. There are several 
comments in master that indicate that it's impossible that an item moves 'left' 
on a page, if we continuously hold a pin on the page. For example, 
_bt_killitems has a comment like this:
 
* Note that if we hold a pin on the target page continuously from initially
 * reading the items until applying this function, VACUUM cannot have deleted
 * any items from the page, and so there is no need to search left from the
 * recorded offset.  (This observation also guarantees that the item is still
 * the right one to delete, which might otherwise be questionable since heap
 * TIDs can get recycled.)  This holds true even if the page has been 
modified
 * by inserts and page splits, so there is no need to consult the LSN.
 
Still, exactly this case happens in practice. In my tests I was able to get 
behavior like:
1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but *keep* the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have 
occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved 
left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I 
added code in my patch to also look left on the page from the previous 
indexOffset.

However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part 
of Postgres that might move index items left even though there are others 
holding a pin?

-Floris

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169




Re: Optimize single tuple fetch from nbtree index

2019-08-04 Thread Michail Nikolaev
Hello everyone.

I am also was looking into possibility of such optimisation few days ago
(attempt to reduce memcpy overhead on IndexOnlyScan).

One thing I noticed here - whole page is scanned only if index quals are
"opened" at some side.

So,  in case of
 SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC
LIMIT 1;
whole index page will be read.

But
SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp AND ts<:=timestamp -
:interval  ORDER BY k, ts DESC LIMIT 1;
is semantically the same, but only few :interval records will be processed.

So, you could try to compare such query in your benchmarks.

Also, some info about current design is contained in
src\backend\access\nbtree\README ("To minimize lock/unlock traffic, an
index scan always searches a leaf page
to identify all the matching items at once").

Thanks,
 Michail.


Re: Optimize single tuple fetch from nbtree index

2019-08-02 Thread Peter Geoghegan
On Fri, Aug 2, 2019 at 5:34 PM Peter Geoghegan  wrote:
> I wonder if some variety of block nested loop join would be helpful
> here. I'm not aware of any specific design that would help with
> Floris' case, but the idea of reducing the number of scans required on
> the inner side by buffering outer side tuples (say based on the
> "k=:val" constant) seems like it might generalize well enough. I
> suggest Floris look into that possibility. This paper might be worth a
> read:
>
> https://dl.acm.org/citation.cfm?id=582278

Actually, having looked at the test case in more detail, that now
seems less likely. The test case seems designed to reward making it
cheaper to access one specific tuple among a fairly large group of
related tuples -- reducing the number of inner scans is not going to
be possible there.

If this really is totally representative of the case that Floris cares
about, I suppose that the approach taken more or less makes sense.
Unfortunately, it doesn't seem like an optimization that many other
users would find compelling, partly because it's only concerned with
fixed overheads, and partly because most queries don't actually look
like this.

-- 
Peter Geoghegan




Re: Optimize single tuple fetch from nbtree index

2019-08-02 Thread Peter Geoghegan
On Fri, Aug 2, 2019 at 1:43 PM Tom Lane  wrote:
> Meh.  I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller.  There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

I agree that this sounds very scary.

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

I wonder if some variety of block nested loop join would be helpful
here. I'm not aware of any specific design that would help with
Floris' case, but the idea of reducing the number of scans required on
the inner side by buffering outer side tuples (say based on the
"k=:val" constant) seems like it might generalize well enough. I
suggest Floris look into that possibility. This paper might be worth a
read:

https://dl.acm.org/citation.cfm?id=582278

(Though it also might not be worth a read -- I haven't actually read it myself.)

--
Peter Geoghegan




Re: Optimize single tuple fetch from nbtree index

2019-08-02 Thread Floris Van Nee
Hi Tom,

Thanks for your quick reply!

> Regardless of whether there's actually a LIMIT 1?  That seems disastrous
> for every other case than the narrow one where the optimization wins.
> Because every other case is now going to have to touch the index page
> twice.  That's more CPU and about double the contention --- if you could
> not measure any degradation from that, you're not measuring the right
> thing.

I thought the same as well at first. Note that I did measure degradation of 
2-3% as mentioned on some cases, but initially I also expected worse. Do you 
have any ideas on cases that would suffer the most? I thought the tight inner 
nested loop that I posted in my performance tests would have this index lookup 
as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because 
these improve by a factor 2-3 with the patch). And my theory is that for a 
LIMIT 3, the price paid for this optimization is highest, because it would 
touch the page twice and read all items from it, while only returning three of 
them.

> In principle, you could pass down knowledge of whether there's a LIMIT,
> using the same mechanism used to enable top-N sorting.  But it'd have to
> also go through the AM interface layer, so I'm not sure how messy this
> would be.

This was an idea I had as well and I would be willing to implement such a thing 
if this is deemed interesting enough by the community. However, I didn't want 
to do this for the first version of this patch, as it would be quite some extra 
work, which would be useless if the idea of the patch itself gets rejected 
already. :-) I'd appreciate any pointers in the right direction - I can take a 
look at how top-N sorting pushes the LIMIT down. Given enough interest for the 
basic idea of this patch, I will implement it.

>> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, 
>> descend the tree to a leaf page and read just the first (or possibly two) 
>> tuples. It won't touch the rest of the page yet. If indeed just one tuple 
>> was required, there won't be a call to _bt_next and we're done. If we do 
>> need more than one tuple, _bt_next will resume reading tuples from the index 
>> page at the point where we left off.

> How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

Indeed we don't know that - that's why this initial patch does not make any 
assumptions about this and just assumes the good-weather scenario that 
everything is visible. I'm not sure if it's possible to give an estimation of 
this and whether or not that would be useful. Currently, if it turns out that 
the tuple is not visible, there'll just be another call to _bt_next again which 
will resume reading the page as normal. I'm open to implement any suggestions 
that may improve this.

>> - We need to take into account page splits, insertions and vacuums while we 
>> do not have the read-lock in between _bt_first and the first call to 
>> _bt_next. This made my patch quite a bit more complicated than my initial 
>> implementation.

> Meh.  I think the odds that you got this 100% right are small, and the
> odds that it would be maintainable are smaller.  There's too much that
> can happen if you're not holding any lock --- and there's a lot of active
> work on btree indexes, which could break whatever assumptions you might
> make today.

Agreed, which is also why I posted this initial version of the patch here 
already, to get some input from the experts on this topic what assumptions can 
be made now and in the future. If it turns out that it's completely not 
feasible to do an optimization like this, because of other constraints in the 
btree implementation, then we're done pretty quickly here. :-) For what it's 
worth: the patch at least passes make check consistently - I caught a lot of 
these edge cases related to page splits and insertions while running the 
regression tests, which runs the modified bits of code quite often and in 
parallel. There may be plenty of edge cases left however...

> I'm not unalterably opposed to doing something like this, but my sense
> is that the complexity and probable negative performance impact on other
> cases are not going to look like a good trade-off for optimizing the
> case at hand.

I do think it could be a big win if we could get something like this working. 
Cases with a LIMIT seem common enough to me to make it possible to add some 
extra optimizations, especially if that could lead to 2-3x the TPS for these 
kind of queries. However, it indeed needs to be within a reasonable complexity. 
If it turns out that in order for us to optimize this, we need to add a lot of 
extra complexity, it may not be worth it to add it.

> BTW, you haven't even really made the case that optimizing a query that
> behaves this way is the right thing to be doing ... maybe some other
> plan shape that isn't a nestloop around a LIMIT query would be a better
> solution.

It is 

Re: Optimize single tuple fetch from nbtree index

2019-08-02 Thread Tom Lane
Floris Van Nee  writes:
> Every time the index scan is done, all tuples from the leaf page are
> read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an
> exception for this *only* the first time amgettuple gets called.

Regardless of whether there's actually a LIMIT 1?  That seems disastrous
for every other case than the narrow one where the optimization wins.
Because every other case is now going to have to touch the index page
twice.  That's more CPU and about double the contention --- if you could
not measure any degradation from that, you're not measuring the right
thing.

In principle, you could pass down knowledge of whether there's a LIMIT,
using the same mechanism used to enable top-N sorting.  But it'd have to
also go through the AM interface layer, so I'm not sure how messy this
would be.

> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, 
> descend the tree to a leaf page and read just the first (or possibly two) 
> tuples. It won't touch the rest of the page yet. If indeed just one tuple was 
> required, there won't be a call to _bt_next and we're done. If we do need 
> more than one tuple, _bt_next will resume reading tuples from the index page 
> at the point where we left off.

How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?

> - We need to take into account page splits, insertions and vacuums while we 
> do not have the read-lock in between _bt_first and the first call to 
> _bt_next. This made my patch quite a bit more complicated than my initial 
> implementation.

Meh.  I think the odds that you got this 100% right are small, and the
odds that it would be maintainable are smaller.  There's too much that
can happen if you're not holding any lock --- and there's a lot of active
work on btree indexes, which could break whatever assumptions you might
make today.

I'm not unalterably opposed to doing something like this, but my sense
is that the complexity and probable negative performance impact on other
cases are not going to look like a good trade-off for optimizing the
case at hand.

BTW, you haven't even really made the case that optimizing a query that
behaves this way is the right thing to be doing ... maybe some other
plan shape that isn't a nestloop around a LIMIT query would be a better
solution.

regards, tom lane