Re: Optimize single tuple fetch from nbtree index
>> It seems that it contradicts the very idea of your patch, so probably we >> should look for other ways to optimize this use-case. >> Maybe this restriction can be relaxed for write only tables, that never >> have to reread the page because of visibility, or something like that. >> Also we probably can add to IndexScanDescData info about expected number >> of tuples, to allow index work more optimal >> and avoid the overhead for other loads.= > The idea of the patch is exactly to relax this limitation. I forgot to update > that README file though. The current implementation of the patch should be > correct like this - that's why I added the look-back code on the page if the > tuple couldn't be found anymore on the same location on the page. Similarly, > it'll look on the page to the right if it detected a page split. These two > measures combined should give a correct implementation of the 'it's possible > that a scan stops in the middle of a page' relaxation. However, as Peter and > Tom pointed out earlier, they feel that the performance advantage that this > approach gives, does not outweigh the extra complexity at this time. I'd be > open to other suggestions though. Although now that I think of it - do you mean the case where the tuple that we returned to the caller after _bt_first actually gets deleted (not moved) from the page? I guess that can theoretically happen if _bt_first returns a non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For my understanding, would a situation like the following lead to this (in my patch)? 1) Backend 1 does an index scan and returns the first tuple on _bt_first - this tuple is actually deleted in the heap already, however it's not marked dead yet in the index. 2) Backend 1 does a heap fetch to check actual visibility and determines the tuple is actually dead 3) While backend 1 is busy doing the heap fetch (so in between _bt_first and _bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on the page to mark tuples dead as well as 2) compact items on the page, thereby actually removing this item from the page. 4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to locate the tuple where it left off, but cannot find it anymore because it got removed completely by backend 2. If this is indeed possible then it's a bad issue unfortunately, and quite hard to try to reproduce, as a lot of things need to happen concurrently while doing a visiblity check. As for your patch, I've had some time to take a look at it. For the two TODOs: + /* TODO Is it possible that currPage is not valid anymore? */ + Assert(BTScanPosIsValid(so->currPos)) This Assert exists already a couple of lines earlier at the start of this function. + * TODO It is not clear to me + * why to check scanpos validity based on currPage value. + * I wonder, if we need currPage at all? Is there any codepath that + * assumes that currPage is not the same as BufferGetBlockNumber(buf)? + */ The comments in the source mention the following about this: * We note the buffer's block number so that we can release the pin later. * This allows us to re-read the buffer if it is needed again for hinting. */ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf); As we figured out earlier, so->currPos.buf gets set to invalid when we release the pin by the unpin macro. So, if we don't store currPage number somewhere else, we cannot obtain the pin again if we need it during killitems. I think that's the reason that currPage is stored. Other than the two TODOs in the code, I think the comments really help clarifying what's going on in the code - I'd be happy if this gets added. -Floris
Re: Optimize single tuple fetch from nbtree index
> It seems that it contradicts the very idea of your patch, so probably we > should look for other ways to optimize this use-case. > Maybe this restriction can be relaxed for write only tables, that never > have to reread the page because of visibility, or something like that. > Also we probably can add to IndexScanDescData info about expected number > of tuples, to allow index work more optimal > and avoid the overhead for other loads.= The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current implementation of the patch should be correct like this - that's why I added the look-back code on the page if the tuple couldn't be found anymore on the same location on the page. Similarly, it'll look on the page to the right if it detected a page split. These two measures combined should give a correct implementation of the 'it's possible that a scan stops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the performance advantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to other suggestions though. > That's true. It took me quite some time to understand that existing code > is correct. > There is a comment for the structure's field that claims that > BufferIsValid is the same that BufferIsPinned in ScanPos context. > Attached patch contains some comments' updates. Any suggestions on how > to improve them are welcome. I'll have a look tomorrow. Thanks a lot for writing this up! -Floris
Re: Optimize single tuple fetch from nbtree index
25.08.2019 0:59, Floris Van Nee wrote: Though, I got interested in the comment inconsistency you have found. I added debug message into this code branch of the patch and was able to see it in regression.diffs after 'make check': Speaking of your patch, it seems that the buffer was unpinned and pinned again between two reads, and the condition of holding it continuously has not been met. May I ask what makes you conclude that the condition of holding the pin continuously has not been met? Your reply encouraged me to dig a little bit more into this today. First, I wanted to check if indeed the pin was continuously held by the backend or not. I added some debug info to ReleaseBuffer for this: it turned out that the pin on the buffer was definitely never released by the backend between the calls to _bt_first and _bt_next. So the buffer got compacted while the backend held a pin on it. After some more searching I found the following code: _bt_vacuum_one_page in nbtinsert.c This function compacts one single page without taking a super-exclusive lock. It is used during inserts to make room on a page. I verified that if I comment out the calls to this function, the compacting never happens while I have a pin on the buffer. So I guess that answers my own question: cleaning up garbage during inserts is one of the cases where compacting may happen even while other backends hold a pin to the buffer. Perhaps this should also be more clearly phrased in the comments in eg. _bt_killitems? Because currently those comments make it look like this case never occurs. You're right, the pin was not released between page reads. I also added debug to UnpinBuffer, but now I see that I had interpreted it wrongly. As far as I understand, the issue with your patch is that it breaks the *scan stops "between" pages* assumption and thus it unsafely interacts with _bt_vacuum_one_page() cleanup. See README: >Page read locks are held only for as long as a scan is examining a page. To minimize lock/unlock traffic, an index scan always searches a leaf page to identify all the matching items at once, copying their heap tuple IDs into backend-local storage. The heap tuple IDs are then processed while not holding any page lock within the index. We do continue to hold a pin on the leaf page in some circumstances, to protect against concurrent deletions (see below). In this state the scan is effectively stopped "between" pages, either before or after the page it has pinned. This is safe in the presence of concurrent insertions and even page splits, because items are never moved across pre-existing page boundaries --- so the scan cannot miss any items it should have seen, nor accidentally return the same item twice. and >Once an index tuple has been marked LP_DEAD it can actually be removed from the index immediately; since index scans only stop "between" pages, no scan can lose its place from such a deletion. It seems that it contradicts the very idea of your patch, so probably we should look for other ways to optimize this use-case. Maybe this restriction can be relaxed for write only tables, that never have to reread the page because of visibility, or something like that. Also we probably can add to IndexScanDescData info about expected number of tuples, to allow index work more optimal and avoid the overhead for other loads.= While reading the code to answer your question, I noticed that BTScanPosIsPinned macro name is misleading. It calls BufferIsValid(), not BufferIsPinned() as one could expect. And BufferIsValid in bufmgr.h comment explicitly states that it shouldn't be confused with BufferIsPinned. The same goes for BTScanPosUnpinIfPinned(). I agree the name is misleading. It clearly does something else than how it's named. However, I don't believe this introduces problems in these particular pieces of code, as long as the macro's are always used. BTScanPosIsPinned actually checks whether it's valid and not necessarily whether it's pinned, as you mentioned. However, any time the buffer gets unpinned using the macro BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, any consecutive call to BTScanPosIsPinned should indeed return false. It'd definitely be nice if this gets clarified in comments though. That's true. It took me quite some time to understand that existing code is correct. There is a comment for the structure's field that claims that BufferIsValid is the same that BufferIsPinned in ScanPos context. Attached patch contains some comments' updates. Any suggestions on how to improve them are welcome. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 9b172c1..316a42d 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -1751,6 +1751,8 @@ _bt_killitems(IndexScanDesc
Re: Optimize single tuple fetch from nbtree index
> Hello, > welcome to hackers with your first patch) Thank you. > Though, I got interested in the comment inconsistency you have found. > I added debug message into this code branch of the patch and was able to > see it in regression.diffs after 'make check': > Speaking of your patch, it seems that the buffer was unpinned and pinned > again between two reads, > and the condition of holding it continuously has not been met. May I ask what makes you conclude that the condition of holding the pin continuously has not been met? Your reply encouraged me to dig a little bit more into this today. First, I wanted to check if indeed the pin was continuously held by the backend or not. I added some debug info to ReleaseBuffer for this: it turned out that the pin on the buffer was definitely never released by the backend between the calls to _bt_first and _bt_next. So the buffer got compacted while the backend held a pin on it. After some more searching I found the following code: _bt_vacuum_one_page in nbtinsert.c This function compacts one single page without taking a super-exclusive lock. It is used during inserts to make room on a page. I verified that if I comment out the calls to this function, the compacting never happens while I have a pin on the buffer. So I guess that answers my own question: cleaning up garbage during inserts is one of the cases where compacting may happen even while other backends hold a pin to the buffer. Perhaps this should also be more clearly phrased in the comments in eg. _bt_killitems? Because currently those comments make it look like this case never occurs. > While reading the code to answer your question, I noticed that > BTScanPosIsPinned macro name is misleading. > It calls BufferIsValid(), not BufferIsPinned() as one could expect. > And BufferIsValid in bufmgr.h comment explicitly states that it > shouldn't be confused with BufferIsPinned. > The same goes for BTScanPosUnpinIfPinned(). I agree the name is misleading. It clearly does something else than how it's named. However, I don't believe this introduces problems in these particular pieces of code, as long as the macro's are always used. BTScanPosIsPinned actually checks whether it's valid and not necessarily whether it's pinned, as you mentioned. However, any time the buffer gets unpinned using the macro BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, any consecutive call to BTScanPosIsPinned should indeed return false. It'd definitely be nice if this gets clarified in comments though. -Floris
Re: Optimize single tuple fetch from nbtree index
On Fri, Aug 23, 2019 at 10:14 AM Anastasia Lubennikova wrote: > Though, I got interested in the comment inconsistency you have found. > I added debug message into this code branch of the patch and was able to > see it in regression.diffs after 'make check': > Speaking of your patch, it seems that the buffer was unpinned and pinned > again between two reads, > and the condition of holding it continuously has not been met. See commit 2ed5b87f. The code is supposed to do that, but it might do it more often than is truly necessary. We don't want to block VACUUM by holding a buffer pin for a very long time, which is theoretically possible here. Do you think that it is actually unnecessary here? In other words, do you think that we can fix this without breaking cases that commit 2ed5b87f cares about? I have been suspicious of this commit all along. For example, I noticed that it can cause the kill_prior_tuple mechanism to be ineffective in a way that didn't happen prior to Postgres 9.5: https://postgr.es/m/CAH2-Wz=sfakvmv1x9jh19ej8am8tzn9f-yecips9hrrrqss...@mail.gmail.com That particular complaint was never addressed. I meant to do more on commit 2ed5b87f. > I didn't dig into the code, but this line looks suspicious (see my > findings about BTScanPosIsPinned below): > > /* bump pin on current buffer for assignment to mark buffer */ > if (BTScanPosIsPinned(so->currPos)) > IncrBufferRefCount(so->currPos.buf); > > > While reading the code to answer your question, I noticed that > BTScanPosIsPinned macro name is misleading. > It calls BufferIsValid(), not BufferIsPinned() as one could expect. > And BufferIsValid in bufmgr.h comment explicitly states that it > shouldn't be confused with BufferIsPinned. > The same goes for BTScanPosUnpinIfPinned(). I have always hated this macro. I think that code like the specific code you quoted might be correct, kind of, but it looks like the author was trying to change as little as possible about the code as it existed in 2015, rather than changing things so that everything made sense. It looks like a messy accretion. Let me see if I can get it straight: We're incrementing the ref count on the buffer if and only if it is pinned (by which we mean valid), though only when the scan is valid (which is not the same as pinned). Whether or not we increment the count of a valid scan doesn't affect anything else we do (i.e. we still restore a marked position either way). This is just awful. > I propose that we update BTScanPosIsPinned macro. Or, at least write a > comment, why its current behavior is fine. > There are a few existing callers, that are definitely expecting that > this macro checks a pin, which it doesn't do. > I don't quite understand if that already causes any subtle bug, or the > current algorithm is fine. I think that you're right -- at a minimum, this requires more documentation. This code is a few years old, but I still wouldn't be surprised if it turned out to be slightly wrong in a way that was important. We still have no way of detecting if a buffer is accessed without a pin. There have been numerous bugs like that before. (We have talked about teaching Valgrind to detect the case, but that never actually happened.) -- Peter Geoghegan
Re: Optimize single tuple fetch from nbtree index
05.08.2019 14:34, Floris Van Nee wrote: I have one further question about these index offsets. There are several comments in master that indicate that it's impossible that an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems has a comment like this: * Note that if we hold a pin on the target page continuously from initially * reading the items until applying this function, VACUUM cannot have deleted * any items from the page, and so there is no need to search left from the * recorded offset. (This observation also guarantees that the item is still * the right one to delete, which might otherwise be questionable since heap * TIDs can get recycled.) This holds true even if the page has been modified * by inserts and page splits, so there is no need to consult the LSN. Still, exactly this case happens in practice. In my tests I was able to get behavior like: 1) pin + lock a page in _bt_first 2) read a tuple, record indexOffset (for example offset=100) and heap tid 3) unlock page, but*keep* the pin (end of _bt_first of my patch) 4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred) 5) look inside the current page for the heap Tid that we registered earlier 6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible. This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left on the page from the previous indexOffset. However, this is in contradiction with the comments (and code) of _bt_killitems. Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items left even though there are others holding a pin? Hello, welcome to hackers with your first patch) As far as I understood from the thread above, the design of this optimization is under discussion, so I didn't review the proposed patch itself. Though, I got interested in the comment inconsistency you have found. I added debug message into this code branch of the patch and was able to see it in regression.diffs after 'make check': Speaking of your patch, it seems that the buffer was unpinned and pinned again between two reads, and the condition of holding it continuously has not been met. I didn't dig into the code, but this line looks suspicious (see my findings about BTScanPosIsPinned below): /* bump pin on current buffer for assignment to mark buffer */ if (BTScanPosIsPinned(so->currPos)) IncrBufferRefCount(so->currPos.buf); While reading the code to answer your question, I noticed that BTScanPosIsPinned macro name is misleading. It calls BufferIsValid(), not BufferIsPinned() as one could expect. And BufferIsValid in bufmgr.h comment explicitly states that it shouldn't be confused with BufferIsPinned. The same goes for BTScanPosUnpinIfPinned(). I propose that we update BTScanPosIsPinned macro. Or, at least write a comment, why its current behavior is fine. There are a few existing callers, that are definitely expecting that this macro checks a pin, which it doesn't do. I don't quite understand if that already causes any subtle bug, or the current algorithm is fine. Peter, Tom, what do you think? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Optimize single tuple fetch from nbtree index
Hi Peter, > Actually, having looked at the test case in more detail, that now > seems less likely. The test case seems designed to reward making it > cheaper to access one specific tuple among a fairly large group of > related tuples -- reducing the number of inner scans is not going to > be possible there. > If this really is totally representative of the case that Floris cares > about, I suppose that the approach taken more or less makes sense. > Unfortunately, it doesn't seem like an optimization that many other > users would find compelling, partly because it's only concerned with > fixed overheads, and partly because most queries don't actually look > like this. Thanks for taking a look. Unfortunately this is exactly the case I care about. I'm a bit puzzled as to why this case wouldn't come up more often by other users though. We have many large tables with timeseries data and it seems to me that with timeseries data, two of the most common queries are: (1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know that there's an update *exactly* at timepoint t - so you're left with trying to find the latest update smaller than t) (2) What is the state of { a } at timepoints { t1, t2, t3 ... } Given that a1,a2,a3... are indepedently updating, but similar time series (eg. sensor a1 and sensor a2, but both provide a temperature value and update independently from each other). Both of these can also be done with some kind of DISTINCT ON clause, but this is often already much slower than just doing a nested loop of fast index lookups with LIMIT 1 (this depends on the frequency of the timeseries data itself versus the sampling rate of your query though, for high frequency time series and/or low frequency sampling the LIMIT 1 approach is much faster). Note that there is actually some related work to this - in the Index Skip Scan thread [1] a function called _bt_read_closest was developed which also partially reads the page. A Skip Scan has a very similar access pattern to the use case I describe here, because it's also very likely to just require one tuple from the page. Even though the implementation in that patch is currently incorrect, performance of the Skip Scan would likely also be quite a bit faster if it had a correct implementation of this partial page-read and it wouldn't have to read the full page every time. I have one further question about these index offsets. There are several comments in master that indicate that it's impossible that an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems has a comment like this: * Note that if we hold a pin on the target page continuously from initially * reading the items until applying this function, VACUUM cannot have deleted * any items from the page, and so there is no need to search left from the * recorded offset. (This observation also guarantees that the item is still * the right one to delete, which might otherwise be questionable since heap * TIDs can get recycled.) This holds true even if the page has been modified * by inserts and page splits, so there is no need to consult the LSN. Still, exactly this case happens in practice. In my tests I was able to get behavior like: 1) pin + lock a page in _bt_first 2) read a tuple, record indexOffset (for example offset=100) and heap tid 3) unlock page, but *keep* the pin (end of _bt_first of my patch) 4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred) 5) look inside the current page for the heap Tid that we registered earlier 6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible. This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left on the page from the previous indexOffset. However, this is in contradiction with the comments (and code) of _bt_killitems. Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items left even though there are others holding a pin? -Floris [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKW4dXTP9G%2BWBskjT09tzD%2B9aMWEm%3DFpeb6RS5SXfPyKw%40mail.gmail.com#21abe755d5cf36aabaaa048c8a282169
Re: Optimize single tuple fetch from nbtree index
Hello everyone. I am also was looking into possibility of such optimisation few days ago (attempt to reduce memcpy overhead on IndexOnlyScan). One thing I noticed here - whole page is scanned only if index quals are "opened" at some side. So, in case of SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1; whole index page will be read. But SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp AND ts<:=timestamp - :interval ORDER BY k, ts DESC LIMIT 1; is semantically the same, but only few :interval records will be processed. So, you could try to compare such query in your benchmarks. Also, some info about current design is contained in src\backend\access\nbtree\README ("To minimize lock/unlock traffic, an index scan always searches a leaf page to identify all the matching items at once"). Thanks, Michail.
Re: Optimize single tuple fetch from nbtree index
On Fri, Aug 2, 2019 at 5:34 PM Peter Geoghegan wrote: > I wonder if some variety of block nested loop join would be helpful > here. I'm not aware of any specific design that would help with > Floris' case, but the idea of reducing the number of scans required on > the inner side by buffering outer side tuples (say based on the > "k=:val" constant) seems like it might generalize well enough. I > suggest Floris look into that possibility. This paper might be worth a > read: > > https://dl.acm.org/citation.cfm?id=582278 Actually, having looked at the test case in more detail, that now seems less likely. The test case seems designed to reward making it cheaper to access one specific tuple among a fairly large group of related tuples -- reducing the number of inner scans is not going to be possible there. If this really is totally representative of the case that Floris cares about, I suppose that the approach taken more or less makes sense. Unfortunately, it doesn't seem like an optimization that many other users would find compelling, partly because it's only concerned with fixed overheads, and partly because most queries don't actually look like this. -- Peter Geoghegan
Re: Optimize single tuple fetch from nbtree index
On Fri, Aug 2, 2019 at 1:43 PM Tom Lane wrote: > Meh. I think the odds that you got this 100% right are small, and the > odds that it would be maintainable are smaller. There's too much that > can happen if you're not holding any lock --- and there's a lot of active > work on btree indexes, which could break whatever assumptions you might > make today. I agree that this sounds very scary. > BTW, you haven't even really made the case that optimizing a query that > behaves this way is the right thing to be doing ... maybe some other > plan shape that isn't a nestloop around a LIMIT query would be a better > solution. I wonder if some variety of block nested loop join would be helpful here. I'm not aware of any specific design that would help with Floris' case, but the idea of reducing the number of scans required on the inner side by buffering outer side tuples (say based on the "k=:val" constant) seems like it might generalize well enough. I suggest Floris look into that possibility. This paper might be worth a read: https://dl.acm.org/citation.cfm?id=582278 (Though it also might not be worth a read -- I haven't actually read it myself.) -- Peter Geoghegan
Re: Optimize single tuple fetch from nbtree index
Hi Tom, Thanks for your quick reply! > Regardless of whether there's actually a LIMIT 1? That seems disastrous > for every other case than the narrow one where the optimization wins. > Because every other case is now going to have to touch the index page > twice. That's more CPU and about double the contention --- if you could > not measure any degradation from that, you're not measuring the right > thing. I thought the same as well at first. Note that I did measure degradation of 2-3% as mentioned on some cases, but initially I also expected worse. Do you have any ideas on cases that would suffer the most? I thought the tight inner nested loop that I posted in my performance tests would have this index lookup as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because these improve by a factor 2-3 with the patch). And my theory is that for a LIMIT 3, the price paid for this optimization is highest, because it would touch the page twice and read all items from it, while only returning three of them. > In principle, you could pass down knowledge of whether there's a LIMIT, > using the same mechanism used to enable top-N sorting. But it'd have to > also go through the AM interface layer, so I'm not sure how messy this > would be. This was an idea I had as well and I would be willing to implement such a thing if this is deemed interesting enough by the community. However, I didn't want to do this for the first version of this patch, as it would be quite some extra work, which would be useless if the idea of the patch itself gets rejected already. :-) I'd appreciate any pointers in the right direction - I can take a look at how top-N sorting pushes the LIMIT down. Given enough interest for the basic idea of this patch, I will implement it. >> This calls _bt_first in nbtsearch.c, which will, if there are scankeys, >> descend the tree to a leaf page and read just the first (or possibly two) >> tuples. It won't touch the rest of the page yet. If indeed just one tuple >> was required, there won't be a call to _bt_next and we're done. If we do >> need more than one tuple, _bt_next will resume reading tuples from the index >> page at the point where we left off. > How do you know how many index entries you have to fetch to get a tuple that's live/visible to the query? Indeed we don't know that - that's why this initial patch does not make any assumptions about this and just assumes the good-weather scenario that everything is visible. I'm not sure if it's possible to give an estimation of this and whether or not that would be useful. Currently, if it turns out that the tuple is not visible, there'll just be another call to _bt_next again which will resume reading the page as normal. I'm open to implement any suggestions that may improve this. >> - We need to take into account page splits, insertions and vacuums while we >> do not have the read-lock in between _bt_first and the first call to >> _bt_next. This made my patch quite a bit more complicated than my initial >> implementation. > Meh. I think the odds that you got this 100% right are small, and the > odds that it would be maintainable are smaller. There's too much that > can happen if you're not holding any lock --- and there's a lot of active > work on btree indexes, which could break whatever assumptions you might > make today. Agreed, which is also why I posted this initial version of the patch here already, to get some input from the experts on this topic what assumptions can be made now and in the future. If it turns out that it's completely not feasible to do an optimization like this, because of other constraints in the btree implementation, then we're done pretty quickly here. :-) For what it's worth: the patch at least passes make check consistently - I caught a lot of these edge cases related to page splits and insertions while running the regression tests, which runs the modified bits of code quite often and in parallel. There may be plenty of edge cases left however... > I'm not unalterably opposed to doing something like this, but my sense > is that the complexity and probable negative performance impact on other > cases are not going to look like a good trade-off for optimizing the > case at hand. I do think it could be a big win if we could get something like this working. Cases with a LIMIT seem common enough to me to make it possible to add some extra optimizations, especially if that could lead to 2-3x the TPS for these kind of queries. However, it indeed needs to be within a reasonable complexity. If it turns out that in order for us to optimize this, we need to add a lot of extra complexity, it may not be worth it to add it. > BTW, you haven't even really made the case that optimizing a query that > behaves this way is the right thing to be doing ... maybe some other > plan shape that isn't a nestloop around a LIMIT query would be a better > solution. It is
Re: Optimize single tuple fetch from nbtree index
Floris Van Nee writes: > Every time the index scan is done, all tuples from the leaf page are > read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an > exception for this *only* the first time amgettuple gets called. Regardless of whether there's actually a LIMIT 1? That seems disastrous for every other case than the narrow one where the optimization wins. Because every other case is now going to have to touch the index page twice. That's more CPU and about double the contention --- if you could not measure any degradation from that, you're not measuring the right thing. In principle, you could pass down knowledge of whether there's a LIMIT, using the same mechanism used to enable top-N sorting. But it'd have to also go through the AM interface layer, so I'm not sure how messy this would be. > This calls _bt_first in nbtsearch.c, which will, if there are scankeys, > descend the tree to a leaf page and read just the first (or possibly two) > tuples. It won't touch the rest of the page yet. If indeed just one tuple was > required, there won't be a call to _bt_next and we're done. If we do need > more than one tuple, _bt_next will resume reading tuples from the index page > at the point where we left off. How do you know how many index entries you have to fetch to get a tuple that's live/visible to the query? > - We need to take into account page splits, insertions and vacuums while we > do not have the read-lock in between _bt_first and the first call to > _bt_next. This made my patch quite a bit more complicated than my initial > implementation. Meh. I think the odds that you got this 100% right are small, and the odds that it would be maintainable are smaller. There's too much that can happen if you're not holding any lock --- and there's a lot of active work on btree indexes, which could break whatever assumptions you might make today. I'm not unalterably opposed to doing something like this, but my sense is that the complexity and probable negative performance impact on other cases are not going to look like a good trade-off for optimizing the case at hand. BTW, you haven't even really made the case that optimizing a query that behaves this way is the right thing to be doing ... maybe some other plan shape that isn't a nestloop around a LIMIT query would be a better solution. regards, tom lane
Optimize single tuple fetch from nbtree index
gbench results for patched branch - out_master.txt pgbench results for master branch (can be diffed with out_limit.txt to efficiently see the difference) - init_test.sql: creates a simple table for the test cases and fills it with data - test_fwd.sql: the nested loop example, parameters :nlimit and :nitems to specify how many rows per inner loop to limit to and how many iterations of the loop need to be done. we're returning a sum() over a column to make sure transferring result data over the socket is not the bottleneck - this is to show the worst-case behavior of this patch. When selecting the full row instead of the sum, the performance difference is negligible for the 'bad' cases, while still providing great (up to 2.5x) improvements for the 'good' cases. This can be tested by changing the sum() to select a column per row instead. - test_fwd_eq.sql: nested loop with simple unique equality select - test_fwd_single.sql: single query with LIMIT without nested loop -Floris 0001-Optimize-single-tuple-fetch-from-nbtree-index-v01.patch Description: 0001-Optimize-single-tuple-fetch-from-nbtree-index-v01.patch init_test.sql Description: init_test.sql + dirname ./test_limit.sh cd . pwd +++ DIR=/home/florisvannee/workspace/pg_dev/limit +++ T=20 +++ /home/florisvannee/workspace/pg_dev/start.sh [[ 0 -eq 2 ]] [[ 0 -eq 1 ]] bdir=/home/florisvannee/workspace/postgres cd /home/florisvannee/workspace/postgres + git rev-parse --abbrev-ref HEAD totaldir=limit-release /home/florisvannee/install/limit-release/bin/pg_ctl -D/home/florisvannee/pgdata/limit-release -l /home/florisvannee/pgdata/limit-release/logfile -t 5 stop pg_ctl: could not send stop signal (PID: 28145): No such process pkill -9 postgres /home/florisvannee/install/limit-release/bin/pg_ctl -D/home/florisvannee/pgdata/limit-release -l /home/florisvannee/pgdata/limit-release/logfile start pg_ctl: another server might be running; trying to start server anyway waiting for server to start done server started +++ pgbench -n -f /home/florisvannee/workspace/pg_dev/limit/test_fwd_eq.sql -T 20 postgres -Dnitems=1 transaction type: /home/florisvannee/workspace/pg_dev/limit/test_fwd_eq.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 20 s number of transactions actually processed: 156764 latency average = 0.128 ms tps = 7838.184195 (including connections establishing) tps = 7839.331599 (excluding connections establishing) +++ pgbench -n -f /home/florisvannee/workspace/pg_dev/limit/test_fwd_eq.sql -T 20 postgres -Dnitems=10 transaction type: /home/florisvannee/workspace/pg_dev/limit/test_fwd_eq.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 20 s number of transactions actually processed: 131079 latency average = 0.153 ms tps = 6553.920794 (including connections establishing) tps = 6554.874046 (excluding connections establishing) +++ pgbench -n -f /home/florisvannee/workspace/pg_dev/limit/test_fwd_eq.sql -T 20 postgres -Dnitems=100 transaction type: /home/florisvannee/workspace/pg_dev/limit/test_fwd_eq.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 20 s number of transactions actually processed: 50692 latency average = 0.395 ms tps = 2534.595231 (including connections establishing) tps = 2534.967502 (excluding connections establishing) +++ pgbench -n -f /home/florisvannee/workspace/pg_dev/limit/test_fwd_eq.sql -T 20 postgres -Dnitems=1000 transaction type: /home/florisvannee/workspace/pg_dev/limit/test_fwd_eq.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 20 s number of transactions actually processed: 7621 latency average = 2.624 ms tps = 381.043284 (including connections establishing) tps = 381.093124 (excluding connections establishing) +++ pgbench -n -f /home/florisvannee/workspace/pg_dev/limit/test_fwd_single.sql -T 20 postgres -Dnlimit=1 transaction type: /home/florisvannee/workspace/pg_dev/limit/test_fwd_single.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 20 s number of transactions actually processed: 229991 latency average = 0.087 ms tps = 11499.514293 (including connections establishing) tps = 11501.125904 (excluding connections establishing) +++ pgbench -n -f /home/florisvannee/workspace/pg_dev/limit/test_fwd_single.sql -T 20 postgres -Dnlimit=3 transaction type: /home/florisvannee/workspace/pg_dev/limit/test_fwd_single.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 20 s number of transactions actually processed: 215881 latency average = 0.093 ms tps = 10794.011481 (including connections establishing) tps = 10794.899742 (excluding connections establishing) +++ pgbench -n -f /home/florisvannee/workspace/pg_dev/limit/test_fwd_single.sql -T 20 postgres -Dnlimit=50 transaction type: /home/florisvannee/wor