Re: index prefetching

2024-03-05 Thread Jakub Wartak
On Fri, Mar 1, 2024 at 3:58 PM Tomas Vondra
 wrote:
[..]
> TBH I don't have a clear idea what to do. It'd be cool to have at least
> some benefits in v17, but I don't know how to do that in a way that
> would be useful in the future.
>
> For example, the v20240124 patch implements this in the executor, but
> based on the recent discussions it seems that's not the right layer -
> the index AM needs to have some control, and I'm not convinced it's
> possible to improve it in that direction (even ignoring the various
> issues we identified in the executor-based approach).
>
> I think it might be more practical to do this from the index AM, even if
> it has various limitations. Ironically, that's what I proposed at pgcon,
> but mostly because it was the quick way to do this.

... that's a pity! :( Well, then let's just finish that subthread, I
gave some explanations, but I'll try to take a look in future
revisions.

> > 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
> > but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
> > (allowing it to follow dynamically; the more prefetches the user wants
> > to perform, the more you spread them across shared LRUs and the more
> > memory for history is required?)
> >
> > + * XXX Maybe we could consider effective_cache_size when sizing the 
> > cache?
> > + * Not to size the cache for that, ofc, but maybe as a guidance of how 
> > many
> > + * heap pages it might keep. Maybe just a fraction fraction of the 
> > value,
> > + * say Max(8MB, effective_cache_size / max_connections) or something.
> > + */
> > +#definePREFETCH_LRU_SIZE8/* slots in one LRU */
> > +#definePREFETCH_LRU_COUNT128 /* number of LRUs */
> > +#definePREFETCH_CACHE_SIZE(PREFETCH_LRU_SIZE *
> > PREFETCH_LRU_COUNT)
> >
>
> I don't see why would this be related to effective_io_concurrency? It's
> merely about how many recently accessed pages we expect to find in the
> page cache. It's entirely separate from the prefetch distance.

Well, my thought was the higher eic is - the more I/O parallelism we
are introducing - in such a case, the more requests we need to
remember from the past to avoid prefetching the same (N * eic, where N
would be some multiplier)

> > 7. in IndexPrefetchComputeTarget()
> >
> > + * XXX We cap the target to plan_rows, becausse it's pointless to 
> > prefetch
> > + * more than we expect to use.
> >
> > That's a nice fact that's already in patch, so XXX isn't needed?
> >
>
> Right, which is why it's not a TODO/FIXME.

OH! That explains it to me. I've taken all of the XXXs as literally
FIXME that you wanted to go away (things to be removed before the
patch is considered mature).

> But I think it's good to
> point this out - I'm not 100% convinced we should be using plan_rows
> like this (because what happens if the estimate happens to be wrong?).

Well, somewhat similiar problematic pattern was present in different
codepath - get_actual_variable_endpoint() - see [1], 9c6ad5eaa95.  So
the final fix was to get away without adding new GUC (which always an
option...), but just introduce a sensible hard-limit (fence) and stick
to the 100 heap visited pages limit. Here we could have similiar
heuristics same from start: if (plan_rows <
we_have_already_visited_pages * avgRowsPerBlock) --> ignore plan_rows
and rampup prefetches back to the full eic value.

> > Some further tests, given data:
> >
> > CREATE TABLE test (id bigint, val bigint, str text);
> > ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
> > INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
> > 3000) FROM generate_series(1, 1) g;
> > -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
> > (10*random())::int), 3000) from (select 1 * random() as r from
> > generate_series(1, 1)) x;
> > VACUUM ANALYZE test;
> > CREATE INDEX on test (id) ;
> >
>
> It's not clear to me what's the purpose of this test? Can you explain?

It's just schema preparation for the tests below:

> >
> > 2. Prefetching for TOASTed heap seems to be not implemented at all,
> > correct? (Is my assumption that we should go like this:
> > t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
> > see the code path where it could be added - certainly it's not blocker
> > -- but maybe in commit message a list of improvements for future could
> > be listed?):
> >
>
> Yes, that's true. I haven't thought about TOAST very much, but with
> prefetching happening in executor, that does not work. There'd need to
> be some extra code for TOAST prefetching. I'm not sure how beneficial
> that would be, considering most TOAST values tend to be stored on
> consecutive heap pages.

Assuming that in the above I've generated data using cyclic / random
version and I run:

SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

(btw: I wanted to use octet_length() at first instead of 

Re: index prefetching

2024-03-01 Thread Peter Geoghegan
On Fri, Mar 1, 2024 at 10:18 AM Tomas Vondra
 wrote:
> But I have very hard time figuring out what the MVP version should be,
> because I have very limited understanding on how much control the index
> AM ought to have :-( And it'd be a bit silly to do something in v17,
> only to have to rip it out in v18 because it turned out to not get the
> split right.

I suspect that you're overestimating the difficulty of getting the
layering right (at least relative to the difficulty of everything
else).

The executor proper doesn't know anything about pins on leaf pages
(and in reality nbtree usually doesn't hold any pins these days). All
the executor knows is that it had better not be possible for an
in-flight index scan to get confused by concurrent TID recycling by
VACUUM. When amgettuple/btgettuple is called, nbtree usually just
returns TIDs it collected from a just-scanned leaf page.

This sort of stuff already lives in the index AM. It seems to me that
everything at the API and executor level can continue to work in
essentially the same way as it always has, with only minimal revision
to the wording around buffer pins (in fact that really should have
happened back in 2015, as part of commit 2ed5b87f).  The hard part
will be figuring out how to make the physical index scan prefetch
optimally, in a way that balances various considerations. These
include:

* Managing heap prefetch distance.

* Avoiding making kill_prior_tuple significantly less effective
(perhaps the new design could even make it more effective, in some
scenarios, by holding onto multiple buffer pins based on a dynamic
model).

* Figuring out how many leaf pages it makes sense to read ahead of
accessing the heap, since there is no fixed relationship between the
number of leaf pages we need to scan to collect a given number of
distinct heap blocks that we need for prefetching. (This is made more
complicated by things like LIMIT, but is actually an independent
problem.)

So I think that you need to teach index AMs to behave roughly as if
multiple leaf pages were read as one single leaf page, at least in
terms of things like how the BTScanOpaqueData.currPos state is
managed. I imagine that currPos will need to be filled with TIDs from
multiple index pages, instead of just one, with entries that are
organized in a way that preserves the illusion of one continuous scan
from the point of view of the executor proper. By the time we actually
start really returning TIDs via btgettuple, it looks like we scanned
one giant leaf page instead of several (the exact number of leaf pages
scanned will probably have to be indeterminate, because it'll depend
on things like heap prefetch distance).

The good news (assuming that I'm right here) is that you don't need to
have specific answers to most of these questions in order to commit a
v1 of index prefeteching. ISTM that all you really need is to have
confidence that the general approach that I've outlined is the right
approach, long term (certainly not nothing, but I'm at least
reasonably confident here).

> The hard thing is what to do about cases where neither of this helps.
> The example I keep thinking about is IOS - if we don't do prefetching,
> it's not hard to construct cases where regular index scan gets much
> faster than IOS (with many not-all-visible pages). But we can't just
> prefetch all pages, because that'd hurt IOS cases with most pages fully
> visible (when we don't need to actually access the heap).
>
> I managed to deal with this in the executor-level version, but I'm not
> sure how to do this if the control moves closer to the index AM.

The reality is that nbtree already knows about index-only scans. It
has to, because it wouldn't be safe to drop the pin on a leaf page's
buffer when the scan is "between pages" in the specific case of
index-only scans (so the _bt_killitems code path used when
kill_prior_tuple has index tuples to kill knows about index-only
scans).

I actually added commentary to the nbtree README that goes into TID
recycling by VACUUM not too long ago. This includes stuff about how
LP_UNUSED items in the heap are considered dead to all index scans
(which can actually try to look at a TID that just became LP_UNUSED in
the heap!), even though LP_UNUSED items don't prevent VACUUM from
setting heap pages all-visible. This seemed like the only way of
explaining the _bt_killitems IOS issue, that actually seemed to make
sense.

What you really want to do here is to balance costs and benefits.
That's just what's required. The fact that those costs and benefits
span multiple levels of abstractions makes it a bit awkward, but
doesn't (and can't) change the basic shape of the problem.

--
Peter Geoghegan




Re: index prefetching

2024-03-01 Thread Tomas Vondra
On 2/15/24 21:30, Peter Geoghegan wrote:
> On Thu, Feb 15, 2024 at 3:13 PM Andres Freund  wrote:
>>> This is why I don't think that the tuples with lower page offset
>>> numbers are in any way significant here.  The significant part is
>>> whether or not you'll actually need to visit more than one leaf page
>>> in the first place (plus the penalty from not being able to reorder
>>> the work across page boundaries in your initial v1 of prefetching).
>>
>> To me this your phrasing just seems to reformulate the issue.
> 
> What I said to Tomas seems very obvious to me. I think that there
> might have been some kind of miscommunication (not a real
> disagreement). I was just trying to work through that.
> 
>> In practical terms you'll have to wait for the full IO latency when fetching
>> the table tuple corresponding to the first tid on a leaf page. Of course
>> that's also the moment you had to visit another leaf page. Whether the stall
>> is due to visit another leaf page or due to processing the first entry on 
>> such
>> a leaf page is a distinction without a difference.
> 
> I don't think anybody said otherwise?
> 
 That's certainly true / helpful, and it makes the "first entry" issue
 much less common. But the issue is still there. Of course, this says
 nothing about the importance of the issue - the impact may easily be so
 small it's not worth worrying about.
>>>
>>> Right. And I want to be clear: I'm really *not* sure how much it
>>> matters. I just doubt that it's worth worrying about in v1 -- time
>>> grows short. Although I agree that we should commit a v1 that leaves
>>> the door open to improving matters in this area in v2.
>>
>> I somewhat doubt that it's realistic to aim for 17 at this point.
> 
> That's a fair point. Tomas?
> 

I think that's a fair assessment.

To me it seems doing the prefetching solely at the executor level is not
really workable. And if it can be made to work, there's far too many
open questions to do that in the last commitfest.

I think the consensus is at least some of the logic/control needs to
move back to the index AM. Maybe there's some minimal part that we could
do for v17, even if it has various limitations, and then improve that in
v18. Say, doing the leaf-page-at-a-time and passing a little bit of
information from the index scan to drive this.

But I have very hard time figuring out what the MVP version should be,
because I have very limited understanding on how much control the index
AM ought to have :-( And it'd be a bit silly to do something in v17,
only to have to rip it out in v18 because it turned out to not get the
split right.

>> We seem to
>> still be doing fairly fundamental architectual work. I think it might be the
>> right thing even for 18 to go for the simpler only-a-single-leaf-page
>> approach though.
> 
> I definitely think it's a good idea to have that as a fall back
> option. And to not commit ourselves to having something better than
> that for v1 (though we probably should commit to making that possible
> in v2).
> 

Yeah, I agree with that.

>> I wonder if there are prerequisites that can be tackled for 17. One idea is 
>> to
>> work on infrastructure to provide executor nodes with information about the
>> number of tuples likely to be fetched - I suspect we'll trigger regressions
>> without that in place.
> 
> I don't think that there'll be regressions if we just take the simpler
> only-a-single-leaf-page approach. At least it seems much less likely.
> 

I'm sure we could pass additional information from the index scans to
improve that further. But I think the gradual ramp-up would deal with
most regressions. At least that's my experience from benchmarking the
early version.

The hard thing is what to do about cases where neither of this helps.
The example I keep thinking about is IOS - if we don't do prefetching,
it's not hard to construct cases where regular index scan gets much
faster than IOS (with many not-all-visible pages). But we can't just
prefetch all pages, because that'd hurt IOS cases with most pages fully
visible (when we don't need to actually access the heap).

I managed to deal with this in the executor-level version, but I'm not
sure how to do this if the control moves closer to the index AM.

>> One way to *sometimes* process more than a single leaf page, without having 
>> to
>> redesign kill_prior_tuple, would be to use the visibilitymap to check if the
>> target pages are all-visible. If all the table pages on a leaf page are
>> all-visible, we know that we don't need to kill index entries, and thus can
>> move on to the next leaf page
> 
> It's possible that we'll need a variety of different strategies.
> nbtree already has two such strategies in _bt_killitems(), in a way.
> Though its "Modified while not pinned means hinting is not safe" path
> (LSN doesn't match canary value path) seems pretty naive. The
> prefetching stuff might present us with a good opportunity to replace
> that with something 

Re: index prefetching

2024-03-01 Thread Tomas Vondra
Hi,

Thanks for looking at the patch!


On 3/1/24 09:20, Jakub Wartak wrote:
> On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra
>  wrote:
> [
>>
>> (1) Melanie actually presented a very different way to implement this,
>> relying on the StreamingRead API. So chances are this struct won't
>> actually be used.
> 
> Given lots of effort already spent on this and the fact that is thread
> is actually two:
> 
> a. index/table prefetching since Jun 2023 till ~Jan 2024
> b. afterwards index/table prefetching with Streaming API, but there
> are some doubts of whether it could happen for v17 [1]
> 
> ... it would be pitty to not take benefits of such work (even if
> Streaming API wouldn't be ready for this; although there's lots of
> movement in the area), so I've played a little with with the earlier
> implementation from [2] without streaming API as it already received
> feedback, it demonstrated big benefits, and earlier it got attention
> on pgcon unconference. Perhaps, some of those comment might be passed
> later to the "b"-patch (once that's feasible):
> 

TBH I don't have a clear idea what to do. It'd be cool to have at least
some benefits in v17, but I don't know how to do that in a way that
would be useful in the future.

For example, the v20240124 patch implements this in the executor, but
based on the recent discussions it seems that's not the right layer -
the index AM needs to have some control, and I'm not convinced it's
possible to improve it in that direction (even ignoring the various
issues we identified in the executor-based approach).

I think it might be more practical to do this from the index AM, even if
it has various limitations. Ironically, that's what I proposed at pgcon,
but mostly because it was the quick way to do this.

> 1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does
> not apply cleanly anymore, due show_buffer_usage() being quite
> recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 :
> 
> patching file src/backend/commands/explain.c
> Hunk #1 FAILED at 3568.
> Hunk #2 FAILED at 3679.
> 2 out of 2 hunks FAILED -- saving rejects to file
> src/backend/commands/explain.c.rej
> 
> 2. v2 applies (fixup), but it would nice to see that integrated into
> main patch (it adds IndexOnlyPrefetchInfo) into one patch
> 

Yeah, but I think it was an old patch version, no point in rebasing that
forever. Also, I'm not really convinced the executor-level approach is
the right path forward.

> 3. execMain.c :
> 
> + * XXX It might be possible to improve the prefetching code
> to handle this
> + * by "walking back" the TID queue, but it's not clear if
> it's worth it.
> 
> Shouldn't we just remove the XXX? The walking-back seems to be niche
> so are fetches using cursors when looking at real world users queries
> ? (support cases bias here when looking at peopel's pg_stat_activity)
> 
> 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
> but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
> (allowing it to follow dynamically; the more prefetches the user wants
> to perform, the more you spread them across shared LRUs and the more
> memory for history is required?)
> 
> + * XXX Maybe we could consider effective_cache_size when sizing the 
> cache?
> + * Not to size the cache for that, ofc, but maybe as a guidance of how 
> many
> + * heap pages it might keep. Maybe just a fraction fraction of the value,
> + * say Max(8MB, effective_cache_size / max_connections) or something.
> + */
> +#definePREFETCH_LRU_SIZE8/* slots in one LRU */
> +#definePREFETCH_LRU_COUNT128 /* number of LRUs */
> +#definePREFETCH_CACHE_SIZE(PREFETCH_LRU_SIZE *
> PREFETCH_LRU_COUNT)
> 

I don't see why would this be related to effective_io_concurrency? It's
merely about how many recently accessed pages we expect to find in the
page cache. It's entirely separate from the prefetch distance.

> BTW:
> + * heap pages it might keep. Maybe just a fraction fraction of the value,
> that's a duplicated "fraction" word over there.
> 
> 5.
> + * XXX Could it be harmful that we read the queue backwards?
> Maybe memory
> + * prefetching works better for the forward direction?
> 
> I wouldn't care, we are optimizing I/O (and context-switching) which
> weighs much more than memory access direction impact and Dilipi
> earlier also expressed no concern, so maybe it could be also removed
> (one less "XXX" to care about)
> 

Yeah, I think it's negligible. Probably a microoptimization we can
investigate later, I don't want to complicate the code unnecessarily.

> 6. in IndexPrefetchFillQueue()
> 
> +while (!PREFETCH_QUEUE_FULL(prefetch))
> +{
> +IndexPrefetchEntry *entry
> += prefetch->next_cb(scan, direction, prefetch->data);
> 
> If we are at it... that's a strange split and assignment not indented :^)
> 
> 7. in 

Re: index prefetching

2024-03-01 Thread Jakub Wartak
On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra
 wrote:
[
>
> (1) Melanie actually presented a very different way to implement this,
> relying on the StreamingRead API. So chances are this struct won't
> actually be used.

Given lots of effort already spent on this and the fact that is thread
is actually two:

a. index/table prefetching since Jun 2023 till ~Jan 2024
b. afterwards index/table prefetching with Streaming API, but there
are some doubts of whether it could happen for v17 [1]

... it would be pitty to not take benefits of such work (even if
Streaming API wouldn't be ready for this; although there's lots of
movement in the area), so I've played a little with with the earlier
implementation from [2] without streaming API as it already received
feedback, it demonstrated big benefits, and earlier it got attention
on pgcon unconference. Perhaps, some of those comment might be passed
later to the "b"-patch (once that's feasible):

1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does
not apply cleanly anymore, due show_buffer_usage() being quite
recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 :

patching file src/backend/commands/explain.c
Hunk #1 FAILED at 3568.
Hunk #2 FAILED at 3679.
2 out of 2 hunks FAILED -- saving rejects to file
src/backend/commands/explain.c.rej

2. v2 applies (fixup), but it would nice to see that integrated into
main patch (it adds IndexOnlyPrefetchInfo) into one patch

3. execMain.c :

+ * XXX It might be possible to improve the prefetching code
to handle this
+ * by "walking back" the TID queue, but it's not clear if
it's worth it.

Shouldn't we just remove the XXX? The walking-back seems to be niche
so are fetches using cursors when looking at real world users queries
? (support cases bias here when looking at peopel's pg_stat_activity)

4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
(allowing it to follow dynamically; the more prefetches the user wants
to perform, the more you spread them across shared LRUs and the more
memory for history is required?)

+ * XXX Maybe we could consider effective_cache_size when sizing the cache?
+ * Not to size the cache for that, ofc, but maybe as a guidance of how many
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
+ * say Max(8MB, effective_cache_size / max_connections) or something.
+ */
+#definePREFETCH_LRU_SIZE8/* slots in one LRU */
+#definePREFETCH_LRU_COUNT128 /* number of LRUs */
+#definePREFETCH_CACHE_SIZE(PREFETCH_LRU_SIZE *
PREFETCH_LRU_COUNT)

BTW:
+ * heap pages it might keep. Maybe just a fraction fraction of the value,
that's a duplicated "fraction" word over there.

5.
+ * XXX Could it be harmful that we read the queue backwards?
Maybe memory
+ * prefetching works better for the forward direction?

I wouldn't care, we are optimizing I/O (and context-switching) which
weighs much more than memory access direction impact and Dilipi
earlier also expressed no concern, so maybe it could be also removed
(one less "XXX" to care about)

6. in IndexPrefetchFillQueue()

+while (!PREFETCH_QUEUE_FULL(prefetch))
+{
+IndexPrefetchEntry *entry
+= prefetch->next_cb(scan, direction, prefetch->data);

If we are at it... that's a strange split and assignment not indented :^)

7. in IndexPrefetchComputeTarget()

+ * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
+ * more than we expect to use.

That's a nice fact that's already in patch, so XXX isn't needed?

8.
+ * XXX Maybe we should reduce the value with parallel workers?

I was assuming it could be a good idea, but the same doesn't seem
(eic/actual_parallel_works_per_gather) to be performed for bitmap heap
scan prefetches, so no?

9.
+/*
+ * No prefetching for direct I/O.
+ *
+ * XXX Shouldn't we do prefetching even for direct I/O? We would only
+ * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
+ * once the code starts loading into shared buffers, that'd work.
+ */
+if ((io_direct_flags & IO_DIRECT_DATA) != 0)
+return 0;

It's redundant (?) and could be removed as
PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571:

 5   #ifdef USE_PREFETCH
 4   │   │   /*
 3   │   │   │* Try to initiate an asynchronous read.  This
returns false in
 2   │   │   │* recovery if the relation file doesn't exist.
 1   │   │   │*/
   571   │   │   if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
 1   │   │   │   smgrprefetch(smgr_reln, forkNum, blockNum, 1))
 2   │   │   {
 3   │   │   │   result.initiated_io = true;
 4   │   │   }
 5   #endif> >   >   >   >   >   >   /* 

Re: index prefetching

2024-02-15 Thread Peter Geoghegan
On Thu, Feb 15, 2024 at 3:13 PM Andres Freund  wrote:
> > This is why I don't think that the tuples with lower page offset
> > numbers are in any way significant here.  The significant part is
> > whether or not you'll actually need to visit more than one leaf page
> > in the first place (plus the penalty from not being able to reorder
> > the work across page boundaries in your initial v1 of prefetching).
>
> To me this your phrasing just seems to reformulate the issue.

What I said to Tomas seems very obvious to me. I think that there
might have been some kind of miscommunication (not a real
disagreement). I was just trying to work through that.

> In practical terms you'll have to wait for the full IO latency when fetching
> the table tuple corresponding to the first tid on a leaf page. Of course
> that's also the moment you had to visit another leaf page. Whether the stall
> is due to visit another leaf page or due to processing the first entry on such
> a leaf page is a distinction without a difference.

I don't think anybody said otherwise?

> > > That's certainly true / helpful, and it makes the "first entry" issue
> > > much less common. But the issue is still there. Of course, this says
> > > nothing about the importance of the issue - the impact may easily be so
> > > small it's not worth worrying about.
> >
> > Right. And I want to be clear: I'm really *not* sure how much it
> > matters. I just doubt that it's worth worrying about in v1 -- time
> > grows short. Although I agree that we should commit a v1 that leaves
> > the door open to improving matters in this area in v2.
>
> I somewhat doubt that it's realistic to aim for 17 at this point.

That's a fair point. Tomas?

> We seem to
> still be doing fairly fundamental architectual work. I think it might be the
> right thing even for 18 to go for the simpler only-a-single-leaf-page
> approach though.

I definitely think it's a good idea to have that as a fall back
option. And to not commit ourselves to having something better than
that for v1 (though we probably should commit to making that possible
in v2).

> I wonder if there are prerequisites that can be tackled for 17. One idea is to
> work on infrastructure to provide executor nodes with information about the
> number of tuples likely to be fetched - I suspect we'll trigger regressions
> without that in place.

I don't think that there'll be regressions if we just take the simpler
only-a-single-leaf-page approach. At least it seems much less likely.

> One way to *sometimes* process more than a single leaf page, without having to
> redesign kill_prior_tuple, would be to use the visibilitymap to check if the
> target pages are all-visible. If all the table pages on a leaf page are
> all-visible, we know that we don't need to kill index entries, and thus can
> move on to the next leaf page

It's possible that we'll need a variety of different strategies.
nbtree already has two such strategies in _bt_killitems(), in a way.
Though its "Modified while not pinned means hinting is not safe" path
(LSN doesn't match canary value path) seems pretty naive. The
prefetching stuff might present us with a good opportunity to replace
that with something fundamentally better.

-- 
Peter Geoghegan




Re: index prefetching

2024-02-15 Thread Andres Freund
Hi,

On 2024-02-15 12:53:10 -0500, Peter Geoghegan wrote:
> On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra
>  wrote:
> > I may be missing something, but it seems fairly self-evident to me an
> > entry at the beginning of an index page won't get prefetched (assuming
> > the page-at-a-time thing).
> 
> Sure, if the first item on the page is also the first item that we
> need the scan to return (having just descended the tree), then it
> won't get prefetched under a scheme that sticks with the current
> page-at-a-time behavior (at least in v1). Just like when the first
> item that we need the scan to return is from the middle of the page,
> or more towards the end of the page.
> 
> It is of course also true that we can't prefetch the next page's
> first item until we actually visit the next page -- clearly that's
> suboptimal. Just like we can't prefetch any other, later tuples from
> the next page (until such time as we have determined for sure that
> there really will be a next page, and have called _bt_readpage for
> that next page.)
>
> This is why I don't think that the tuples with lower page offset
> numbers are in any way significant here.  The significant part is
> whether or not you'll actually need to visit more than one leaf page
> in the first place (plus the penalty from not being able to reorder
> the work across page boundaries in your initial v1 of prefetching).

To me this your phrasing just seems to reformulate the issue.

In practical terms you'll have to wait for the full IO latency when fetching
the table tuple corresponding to the first tid on a leaf page. Of course
that's also the moment you had to visit another leaf page. Whether the stall
is due to visit another leaf page or due to processing the first entry on such
a leaf page is a distinction without a difference.


> > That's certainly true / helpful, and it makes the "first entry" issue
> > much less common. But the issue is still there. Of course, this says
> > nothing about the importance of the issue - the impact may easily be so
> > small it's not worth worrying about.
> 
> Right. And I want to be clear: I'm really *not* sure how much it
> matters. I just doubt that it's worth worrying about in v1 -- time
> grows short. Although I agree that we should commit a v1 that leaves
> the door open to improving matters in this area in v2.

I somewhat doubt that it's realistic to aim for 17 at this point. We seem to
still be doing fairly fundamental architectual work. I think it might be the
right thing even for 18 to go for the simpler only-a-single-leaf-page
approach though.

I wonder if there are prerequisites that can be tackled for 17. One idea is to
work on infrastructure to provide executor nodes with information about the
number of tuples likely to be fetched - I suspect we'll trigger regressions
without that in place.



One way to *sometimes* process more than a single leaf page, without having to
redesign kill_prior_tuple, would be to use the visibilitymap to check if the
target pages are all-visible. If all the table pages on a leaf page are
all-visible, we know that we don't need to kill index entries, and thus can
move on to the next leaf page

Greetings,

Andres Freund




Re: index prefetching

2024-02-15 Thread Peter Geoghegan
On Thu, Feb 15, 2024 at 12:26 PM Tomas Vondra
 wrote:
> I may be missing something, but it seems fairly self-evident to me an
> entry at the beginning of an index page won't get prefetched (assuming
> the page-at-a-time thing).

Sure, if the first item on the page is also the first item that we
need the scan to return (having just descended the tree), then it
won't get prefetched under a scheme that sticks with the current
page-at-a-time behavior (at least in v1). Just like when the first
item that we need the scan to return is from the middle of the page,
or more towards the end of the page.

It is of course also true that we can't prefetch the next page's
first item until we actually visit the next page -- clearly that's
suboptimal. Just like we can't prefetch any other, later tuples from
the next page (until such time as we have determined for sure that
there really will be a next page, and have called _bt_readpage for
that next page.)

This is why I don't think that the tuples with lower page offset
numbers are in any way significant here.  The significant part is
whether or not you'll actually need to visit more than one leaf page
in the first place (plus the penalty from not being able to reorder
the work across page boundaries in your initial v1 of prefetching).

> If I understand your point about boundary cases / suffix truncation,
> that helps us by (a) picking the split in a way to minimize a single key
> spanning multiple pages, if possible and (b) increasing the number of
> entries that fit onto a single index page.

More like it makes the boundaries between leaf pages (i.e. high keys)
align with the "natural boundaries of the key space". Simple point
queries should practically never require more than a single leaf page
access as a result. Even somewhat complicated index scans that are
reasonably selective (think tens to low hundreds of matches) don't
tend to need to read more than a single leaf page match, at least with
equality type scan keys for the index qual.

> That's certainly true / helpful, and it makes the "first entry" issue
> much less common. But the issue is still there. Of course, this says
> nothing about the importance of the issue - the impact may easily be so
> small it's not worth worrying about.

Right. And I want to be clear: I'm really *not* sure how much it
matters. I just doubt that it's worth worrying about in v1 -- time
grows short. Although I agree that we should commit a v1 that leaves
the door open to improving matters in this area in v2.

> One case I've been thinking about is sorting using index, where we often
> read large part of the index.

That definitely seems like a case where reordering
work/desynchronization of the heap and index scans might be relatively
important.

> > I think that there is no question that this will need to not
> > completely disable kill_prior_tuple -- I'd be surprised if one single
> > person disagreed with me on this point. There is also a more nuanced
> > way of describing this same restriction, but we don't necessarily need
> > to agree on what exactly that is right now.
> >
>
> Even for the page-at-a-time approach? Or are you talking about the v2?

I meant that the current kill_prior_tuple behavior isn't sacred, and
can be revised in v2, for the benefit of lifting the restriction on
prefetching. But that's going to involve a trade-off of some kind. And
not a particularly simple one.

> Yeah. The basic idea was that by moving this above index AM it will work
> for all indexes automatically - but given the current discussion about
> kill_prior_tuple, locking etc. I'm not sure that's really feasible.
>
> The index AM clearly needs to have more control over this.

Cool. I think that that makes the layering question a lot clearer, then.


--
Peter Geoghegan




Re: index prefetching

2024-02-15 Thread Tomas Vondra



On 2/15/24 17:42, Peter Geoghegan wrote:
> On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra
>  wrote:
>> On 2/15/24 00:06, Peter Geoghegan wrote:
>>> I suppose that it might be much more important than I imagine it is
>>> right now, but it'd be nice to have something a bit more concrete to
>>> go on.
>>>
>>
>> This probably depends on which corner cases are considered important.
>>
>> The page-at-a-time approach essentially means index items at the
>> beginning of the page won't get prefetched (or vice versa, prefetch
>> distance drops to 0 when we get to end of index page).
> 
> I don't think that's true. At least not for nbtree scans.
> 
> As I went into last year, you'd get the benefit of the work I've done
> on "boundary cases" (most recently in commit c9c0589f from just a
> couple of months back), which helps us get the most out of suffix
> truncation. This maximizes the chances of only having to scan a single
> index leaf page in many important cases. So I can see no reason why
> index items at the beginning of the page are at any particular
> disadvantage (compared to those from the middle or the end of the
> page).
> 

I may be missing something, but it seems fairly self-evident to me an
entry at the beginning of an index page won't get prefetched (assuming
the page-at-a-time thing).

If I understand your point about boundary cases / suffix truncation,
that helps us by (a) picking the split in a way to minimize a single key
spanning multiple pages, if possible and (b) increasing the number of
entries that fit onto a single index page.

That's certainly true / helpful, and it makes the "first entry" issue
much less common. But the issue is still there. Of course, this says
nothing about the importance of the issue - the impact may easily be so
small it's not worth worrying about.

> Where you might have a problem is cases where it's just inherently
> necessary to visit more than a single leaf page, despite the best
> efforts of the nbtsplitloc.c logic -- cases where the scan just
> inherently needs to return tuples that "straddle the boundary between
> two neighboring pages". That isn't a particularly natural restriction,
> but it's also not obvious that it's all that much of a disadvantage in
> practice.
> 

One case I've been thinking about is sorting using index, where we often
read large part of the index.

>> It certainly was a great improvement, no doubt about that. I dislike the
>> restriction, but that's partially for aesthetic reasons - it just seems
>> it'd be nice to not have this.
>>
>> That being said, I'd be OK with having this restriction if it makes v1
>> feasible. For me, the big question is whether it'd mean we're stuck with
>> this restriction forever, or whether there's a viable way to improve
>> this in v2.
> 
> I think that there is no question that this will need to not
> completely disable kill_prior_tuple -- I'd be surprised if one single
> person disagreed with me on this point. There is also a more nuanced
> way of describing this same restriction, but we don't necessarily need
> to agree on what exactly that is right now.
> 

Even for the page-at-a-time approach? Or are you talking about the v2?

>> And I don't have answer to that :-( I got completely lost in the ongoing
>> discussion about the locking implications (which I happily ignored while
>> working on the PoC patch), layering tensions and questions which part
>> should be "in control".
> 
> Honestly, I always thought that it made sense to do things on the
> index AM side. When you went the other way I was surprised. Perhaps I
> should have said more about that, sooner, but I'd already said quite a
> bit at that point, so...
> 
> Anyway, I think that it's pretty clear that "naive desynchronization"
> is just not acceptable, because that'll disable kill_prior_tuple
> altogether. So you're going to have to do this in a way that more or
> less preserves something like the current kill_prior_tuple behavior.
> It's going to have some downsides, but those can be managed. They can
> be managed from within the index AM itself, a bit like the
> _bt_killitems() no-pin stuff does things already.
> 
> Obviously this interpretation suggests that doing things at the index
> AM level is indeed the right way to go, layering-wise. Does it make
> sense to you, though?
> 

Yeah. The basic idea was that by moving this above index AM it will work
for all indexes automatically - but given the current discussion about
kill_prior_tuple, locking etc. I'm not sure that's really feasible.

The index AM clearly needs to have more control over this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-02-15 Thread Peter Geoghegan
On Thu, Feb 15, 2024 at 9:36 AM Tomas Vondra
 wrote:
> On 2/15/24 00:06, Peter Geoghegan wrote:
> > I suppose that it might be much more important than I imagine it is
> > right now, but it'd be nice to have something a bit more concrete to
> > go on.
> >
>
> This probably depends on which corner cases are considered important.
>
> The page-at-a-time approach essentially means index items at the
> beginning of the page won't get prefetched (or vice versa, prefetch
> distance drops to 0 when we get to end of index page).

I don't think that's true. At least not for nbtree scans.

As I went into last year, you'd get the benefit of the work I've done
on "boundary cases" (most recently in commit c9c0589f from just a
couple of months back), which helps us get the most out of suffix
truncation. This maximizes the chances of only having to scan a single
index leaf page in many important cases. So I can see no reason why
index items at the beginning of the page are at any particular
disadvantage (compared to those from the middle or the end of the
page).

Where you might have a problem is cases where it's just inherently
necessary to visit more than a single leaf page, despite the best
efforts of the nbtsplitloc.c logic -- cases where the scan just
inherently needs to return tuples that "straddle the boundary between
two neighboring pages". That isn't a particularly natural restriction,
but it's also not obvious that it's all that much of a disadvantage in
practice.

> It certainly was a great improvement, no doubt about that. I dislike the
> restriction, but that's partially for aesthetic reasons - it just seems
> it'd be nice to not have this.
>
> That being said, I'd be OK with having this restriction if it makes v1
> feasible. For me, the big question is whether it'd mean we're stuck with
> this restriction forever, or whether there's a viable way to improve
> this in v2.

I think that there is no question that this will need to not
completely disable kill_prior_tuple -- I'd be surprised if one single
person disagreed with me on this point. There is also a more nuanced
way of describing this same restriction, but we don't necessarily need
to agree on what exactly that is right now.

> And I don't have answer to that :-( I got completely lost in the ongoing
> discussion about the locking implications (which I happily ignored while
> working on the PoC patch), layering tensions and questions which part
> should be "in control".

Honestly, I always thought that it made sense to do things on the
index AM side. When you went the other way I was surprised. Perhaps I
should have said more about that, sooner, but I'd already said quite a
bit at that point, so...

Anyway, I think that it's pretty clear that "naive desynchronization"
is just not acceptable, because that'll disable kill_prior_tuple
altogether. So you're going to have to do this in a way that more or
less preserves something like the current kill_prior_tuple behavior.
It's going to have some downsides, but those can be managed. They can
be managed from within the index AM itself, a bit like the
_bt_killitems() no-pin stuff does things already.

Obviously this interpretation suggests that doing things at the index
AM level is indeed the right way to go, layering-wise. Does it make
sense to you, though?

-- 
Peter Geoghegan




Re: index prefetching

2024-02-15 Thread Tomas Vondra
On 2/15/24 00:06, Peter Geoghegan wrote:
> On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
>  wrote:
>
>> ...
> 
> 2. Are you sure that the leaf-page-at-a-time thing is such a huge
> hindrance to effective prefetching?
> 
> I suppose that it might be much more important than I imagine it is
> right now, but it'd be nice to have something a bit more concrete to
> go on.
> 

This probably depends on which corner cases are considered important.

The page-at-a-time approach essentially means index items at the
beginning of the page won't get prefetched (or vice versa, prefetch
distance drops to 0 when we get to end of index page).

That may be acceptable, considering we can usually fit 200+ index items
on a single page. Even then it limits what effective_io_concurrency
values are sensible, but in my experience quickly diminish past ~32.


> 3. Even if it is somewhat important, do you really need to get that
> part working in v1?
> 
> Tomas' original prototype worked with the leaf-page-at-a-time thing,
> and that still seemed like a big improvement to me. While being less
> invasive, in effect. If we can agree that something like that
> represents a useful step in the right direction (not an evolutionary
> dead end), then we can make good incremental progress within a single
> release.
> 

It certainly was a great improvement, no doubt about that. I dislike the
restriction, but that's partially for aesthetic reasons - it just seems
it'd be nice to not have this.

That being said, I'd be OK with having this restriction if it makes v1
feasible. For me, the big question is whether it'd mean we're stuck with
this restriction forever, or whether there's a viable way to improve
this in v2.

And I don't have answer to that :-( I got completely lost in the ongoing
discussion about the locking implications (which I happily ignored while
working on the PoC patch), layering tensions and questions which part
should be "in control".


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-02-14 Thread Robert Haas
On Thu, Feb 15, 2024 at 10:33 AM Andres Freund  wrote:
> The issue here is that we need to read index leaf pages (synchronously for
> now!) to get the tids to do readahead of table data. What you describe is done
> for the table data (IMO not a good idea medium term [1]), but the problem at
> hand is that once we've done readahead for all the tids on one index page, we
> can't do more readahead without looking at the next index leaf page.

Oh, right.

> However, if we want to issue table readahead for tids on the neighboring index
> leaf page, we'll - as the patch stands - not hold a pin on the "current" index
> leaf page. Which makes index prefetching as currently implemented incompatible
> with kill_prior_tuple, as that requires the index leaf page pin being held.

But I think it probably also breaks MVCC, as Peter was saying.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: index prefetching

2024-02-14 Thread Andres Freund
Hi,

On 2024-02-15 09:59:27 +0530, Robert Haas wrote:
> I would have thought that the way this prefetching would work is that
> we would bring pages into shared_buffers sooner than we currently do,
> but not actually pin them until we're ready to use them, so that it's
> possible they might be evicted again before we get around to them, if
> we prefetch too far and the system is too busy.

The issue here is that we need to read index leaf pages (synchronously for
now!) to get the tids to do readahead of table data. What you describe is done
for the table data (IMO not a good idea medium term [1]), but the problem at
hand is that once we've done readahead for all the tids on one index page, we
can't do more readahead without looking at the next index leaf page.

Obviously that would lead to a sawtooth like IO pattern, where you'd regularly
have to wait for IO for the first tuples referenced by an index leaf page.

However, if we want to issue table readahead for tids on the neighboring index
leaf page, we'll - as the patch stands - not hold a pin on the "current" index
leaf page. Which makes index prefetching as currently implemented incompatible
with kill_prior_tuple, as that requires the index leaf page pin being held.


> Alternately, it also seems OK to read those later pages and pin them right
> away, as long as (1) we don't also give up pins that we would have held in
> the absence of prefetching and (2) we have some mechanism for limiting the
> number of extra pins that we're holding to a reasonable number given the
> size of shared_buffers.

FWIW, there's already some logic for (2) in LimitAdditionalPins(). Currently
used to limit how many buffers a backend may pin for bulk relation extension.

Greetings,

Andres Freund


[1] The main reasons that I think that just doing readahead without keeping a
pin is a bad idea, at least medium term, are:

a) To do AIO you need to hold a pin on the page while the IO is in progress,
as the target buffer contents will be modified at some moment you don't
control, so that buffer should better not be replaced while IO is in
progress. So at the very least you need to hold a pin until the IO is over.

b) If you do not keep a pin until you actually use the page, you need to
either do another buffer lookup (expensive!) or you need to remember the
buffer id and revalidate that it's still pointing to the same block (cheaper,
but still not cheap).  That's not just bad because it's slow in an absolute
sense, more importantly it increases the potential performance downside of
doing readahead for fully cached workloads, because you don't gain anything,
but pay the price of two lookups/revalidation.

Note that these reasons really just apply to cases where we read ahead because
we are quite certain we'll need exactly those blocks (leaving errors or
queries ending early aside), not for "heuristic" prefetching. If we e.g. were
to issue prefetch requests for neighboring index pages while descending during
an ordered index scan, without checking that we'll need those, it'd make sense
to just do a "throway" prefetch request.




Re: index prefetching

2024-02-14 Thread Robert Haas
On Wed, Feb 14, 2024 at 7:43 PM Tomas Vondra
 wrote:
> I don't think it's just a bookkeeping problem. In a way, nbtree already
> does keep an array of tuples to kill (see btgettuple), but it's always
> for the current index page. So it's not that we immediately go and kill
> the prior tuple - nbtree already stashes it in an array, and kills all
> those tuples when moving to the next index page.
>
> The way I understand the problem is that with prefetching we're bound to
> determine the kill_prior_tuple flag with a delay, in which case we might
> have already moved to the next index page ...

Well... I'm not clear on all of the details of how this works, but
this sounds broken to me, for the reasons that Peter G. mentions in
his comments about desynchronization. If we currently have a rule that
you hold a pin on the index page while processing the heap tuples it
references, you can't just throw that out the window and expect things
to keep working. Saying that kill_prior_tuple doesn't work when you
throw that rule out the window is probably understating the extent of
the problem very considerably.

I would have thought that the way this prefetching would work is that
we would bring pages into shared_buffers sooner than we currently do,
but not actually pin them until we're ready to use them, so that it's
possible they might be evicted again before we get around to them, if
we prefetch too far and the system is too busy. Alternately, it also
seems OK to read those later pages and pin them right away, as long as
(1) we don't also give up pins that we would have held in the absence
of prefetching and (2) we have some mechanism for limiting the number
of extra pins that we're holding to a reasonable number given the size
of shared_buffers.

However, it doesn't seem OK at all to give up pins that the current
code holds sooner than the current code would do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: index prefetching

2024-02-14 Thread Peter Geoghegan
On Wed, Feb 14, 2024 at 7:28 PM Andres Freund  wrote:
> On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote:
> > This property of index scans is fundamental to how index scans work.
> > Pinning an index page as an interlock against concurrently TID
> > recycling by VACUUM is directly described by the index API docs [1],
> > even (the docs actually use terms like "buffer pin" rather than
> > something more abstract sounding). I don't think that anything
> > affecting that behavior should be considered an implementation detail
> > of the nbtree index AM as such (nor any particular index AM).
>
> Given that the interlock is only needed for non-mvcc scans, that non-mvcc
> scans are rare due to catalog accesses using snapshots these days and that
> most non-mvcc scans do single-tuple lookups, it might be viable to be more
> restrictive about prefetching iff non-mvcc snapshots are in use and to use
> method of cleanup that allows multiple pages to be cleaned up otherwise.

I agree, but don't think that it matters all that much.

If you have an MVCC snapshot, that doesn't mean that TID recycle
safety problems automatically go away. It only means that you have one
known and supported alternative approach to dealing with such
problems. It's not like you just get that for free, just by using an
MVCC snapshot, though -- it has downsides. Downsides such as the
current _bt_killitems() behavior with a concurrently-modified leaf
page (modified when we didn't hold a leaf page pin). It'll just give
up on setting any LP_DEAD bits due to noticing that the leaf page's
LSN changed. (Plus there are implementation restrictions that I won't
repeat again now.)

When I refer to the buffer pin interlock, I'm mostly referring to the
general need for something like that in the context of index scans.
Principally in order to make kill_prior_tuple continue to work in
something more or less like its current form.

> However, I don't think we would necessarily have to relax the IAM pinning
> rules, just to be able to do prefetching of more than one index leaf
> page.

To be clear, we already do relax the IAM pinning rules. Or at least
nbtree selectively opts out, as I've gone into already.

> Restricting prefetching to entries within a single leaf page obviously
> has the disadvantage of not being able to benefit from concurrent IO whenever
> crossing a leaf page boundary, but at the same time processing entries from
> just two leaf pages would often allow for a sufficiently aggressive
> prefetching.  Pinning a small number of leaf pages instead of a single leaf
> page shouldn't be a problem.

You're probably right. I just don't see any need to solve that problem in v1.

> One argument for loosening the tight coupling between kill_prior_tuples and
> index scan progress is that the lack of kill_prior_tuples for bitmap scans is
> quite problematic. I've seen numerous production issues with bitmap scans
> caused by subsequent scans processing a growing set of dead tuples, where
> plain index scans were substantially slower initially but didn't get much
> slower over time.

I've seen production issues like that too. No doubt it's a problem.

> We might be able to design a system where the bitmap
> contains a certain number of back-references to the index, allowing later
> cleanup if there weren't any page splits or such.

That does seem possible, but do you really want a design for index
prefetching that relies on that massive enhancement (a total redesign
of kill_prior_tuple) happening at some point in the not-too-distant
future? Seems risky, from a project management point of view.

This back-references idea seems rather complicated, especially if it
needs to work with very large bitmap index scans. Since you'll still
have the basic problem of TID recycle safety to deal with (even with
an MVCC snapshot), you don't just have to revisit the leaf pages. You
also have to revisit the corresponding heap pages (generally they'll
be a lot more numerous than leaf pages). You'll have traded one
problem for another (which is not to say that it's not a good
trade-off).

Right now the executor uses a amgettuple interface, and knows nothing
about index related costs (e.g., pages accessed in any index, index
qual costs). While the index AM has some limited understanding of heap
access costs. So the index AM kinda knows a small bit about both types
of costs (possibly not enough, but something). That informs the
language I'm using to describe all this.

To do something like your "back-references to the index" thing well, I
think that you need more dynamic behavior around when you visit the
heap to get heap tuples pointed to by TIDs from index pages (i.e.
dynamic behavior that determines how many leaf pages to go before
going to the heap to get pointed-to TIDs). That is basically what I
meant by "put the index AM in

Re: index prefetching

2024-02-14 Thread Andres Freund
Hi,

On 2024-02-13 14:54:14 -0500, Peter Geoghegan wrote:
> This property of index scans is fundamental to how index scans work.
> Pinning an index page as an interlock against concurrently TID
> recycling by VACUUM is directly described by the index API docs [1],
> even (the docs actually use terms like "buffer pin" rather than
> something more abstract sounding). I don't think that anything
> affecting that behavior should be considered an implementation detail
> of the nbtree index AM as such (nor any particular index AM).

Given that the interlock is only needed for non-mvcc scans, that non-mvcc
scans are rare due to catalog accesses using snapshots these days and that
most non-mvcc scans do single-tuple lookups, it might be viable to be more
restrictive about prefetching iff non-mvcc snapshots are in use and to use
method of cleanup that allows multiple pages to be cleaned up otherwise.

However, I don't think we would necessarily have to relax the IAM pinning
rules, just to be able to do prefetching of more than one index leaf
page. Restricting prefetching to entries within a single leaf page obviously
has the disadvantage of not being able to benefit from concurrent IO whenever
crossing a leaf page boundary, but at the same time processing entries from
just two leaf pages would often allow for a sufficiently aggressive
prefetching.  Pinning a small number of leaf pages instead of a single leaf
page shouldn't be a problem.


One argument for loosening the tight coupling between kill_prior_tuples and
index scan progress is that the lack of kill_prior_tuples for bitmap scans is
quite problematic. I've seen numerous production issues with bitmap scans
caused by subsequent scans processing a growing set of dead tuples, where
plain index scans were substantially slower initially but didn't get much
slower over time.  We might be able to design a system where the bitmap
contains a certain number of back-references to the index, allowing later
cleanup if there weren't any page splits or such.



> I think that it makes sense to put the index AM in control here --
> that almost follows from what I said about the index AM API. The index
> AM already needs to be in control, in about the same way, to deal with
> kill_prior_tuple (plus it helps with the  LIMIT issue I described).

Depending on what "control" means I'm doubtful:

Imo there are decisions influencing prefetching that an index AM shouldn't
need to know about directly, e.g. how the plan shape influences how many
tuples are actually going to be consumed. Of course that determination could
be made in planner/executor and handed to IAMs, for the IAM to then "control"
the prefetching.

Another aspect is that *long* term I think we want to be able to execute
different parts of the plan tree when one part is blocked for IO. Of course
that's not always possible. But particularly with partitioned queries it often
is.  Depending on the form of "control" that's harder if IAMs are in control,
because control flow needs to return to the executor to be able to switch to a
different node, so we can't wait for IO inside the AM.

There probably are ways IAMs could be in "control" that would be compatible
with such constraints however.

Greetings,

Andres Freund




Re: index prefetching

2024-02-14 Thread Andres Freund
Hi,

On 2024-02-14 16:45:57 -0500, Melanie Plageman wrote:
> > > The LIMIT problem is not very clear to me either. Yes, if we get close
> > > to the end of the leaf page, we may need to visit the next leaf page.
> > > But that's kinda the whole point of prefetching - reading stuff ahead,
> > > and reading too far ahead is an inherent risk. Isn't that a problem we
> > > have even without LIMIT? The prefetch distance ramp up is meant to limit
> > > the impact.
> >
> > Right now, the index AM doesn't know anything about LIMIT at all. That
> > doesn't matter, since the index AM can only read/scan one full leaf
> > page before returning control back to the executor proper. The
> > executor proper can just shut down the whole index scan upon finding
> > that we've already returned N tuples for a LIMIT N.
> >
> > We don't do prefetching right now, but we also don't risk reading a
> > leaf page that'll just never be needed. Those two things are in
> > tension, but I don't think that that's quite the same thing as the
> > usual standard prefetching tension/problem. Here there is uncertainty
> > about whether what we're prefetching will *ever* be required -- not
> > uncertainty about when exactly it'll be required. (Perhaps this
> > distinction doesn't mean much to you. I'm just telling you how I think
> > about it, in case it helps move the discussion forward.)
>
> I don't think that the LIMIT problem is too different for index scans
> than heap scans. We will need some advice from planner to come down to
> prevent over-eager prefetching in all cases.

I'm not sure that that's really true. I think the more common and more
problematic case for partially executing a sub-tree of a query are nested
loops (worse because that happens many times within a query). Particularly for
anti-joins prefetching too aggressively could lead to a significant IO
amplification.

At the same time it's IMO more important to ramp up prefetching distance
fairly aggressively for index scans than it is for sequential scans. For
sequential scans it's quite likely that either the whole scan takes quite a
while (thus slowly ramping doesn't affect overall time that much) or that the
data is cached anyway because the tables are small and frequently used (in
which case we don't need to ramp). And even if smaller tables aren't cached,
because it's sequential IO, the IOs are cheaper as they're sequential.
Contrast that to index scans, where it's much more likely that you have cache
misses in queries that do an overall fairly small number of IOs and where that
IO is largely random.

I think we'll need some awareness at ExecInitNode() time about how the results
of the nodes are used. I see a few "classes":

1) All rows are needed, because the node is below an Agg, Hash, Materialize,
   Sort,  Can be determined purely by the plan shape.

2) All rows are needed, because the node is completely consumed by the
   top-level (i.e. no limit, anti-joins or such inbetween) and the top-level
   wants to run the whole query. Unfortunately I don't think we know this at
   plan time at the moment (it's just determined by what's passed to
   ExecutorRun()).

3) Some rows are needed, but it's hard to know the precise number. E.g. because
   of a LIMIT further up.

4) Only a single row is going to be needed, albeit possibly after filtering on
   the node level. E.g. the anti-join case.


There are different times at which we could determine how each node is
consumed:

a) Determine node consumption "class" purely within ExecInit*, via different
   eflags.

   Today that couldn't deal with 2), but I think it'd not too hard to modify
   callers that consume query results completely to tell that ExecutorStart(),
   not just ExecutorRun().

   A disadvantage would be that this prevents us from taking IO depth into
   account during costing. There very well might be plans that are cheaper
   than others because the plan shape allows more concurrent IO.


b) Determine node consumption class at plan time.

   This also couldn't deal with 2), but fixing that probably would be harder,
   because we'll often not know at plan time how the query will be
   executed. And in fact the same plan might be executed multiple ways, in
   case of prepared statements.

   The obvious advantage is of course that we can influence the choice of
   paths.


I suspect we'd eventually want a mix of both. Plan time to be able to
influence plan shape, ExecInit* to deal with not knowing how the query will be
consumed at plan time.  Which suggests that we could start with whichever is
easier and extend later.


Greetings,

Andres Freund




Re: index prefetching

2024-02-14 Thread Peter Geoghegan
On Wed, Feb 14, 2024 at 4:46 PM Melanie Plageman
 wrote:
> So, a pin on the index leaf page is sufficient to keep line pointers
> from being reused? If we stick to prefetching heap blocks referred to
> by index tuples in a single index leaf page, and we keep that page
> pinned, will we still have a problem?

That's certainly one way of dealing with it. Obviously, there are
questions about how you do that in a way that consistently avoids
creating new problems.

> I don't think that the LIMIT problem is too different for index scans
> than heap scans. We will need some advice from planner to come down to
> prevent over-eager prefetching in all cases.

I think that I'd rather use information at execution time instead, if
at all possible (perhaps in addition to a hint given by the planner).
But it seems a bit premature to discuss this problem now, except to
say that it might indeed be a problem.

> > It's certainly possible that you could figure out various workarounds
> > for each of these issues (plus the kill_prior_tuple issue) with a
> > prefetching design that "de-synchronizes" the index access and the
> > heap access. But it might well be better to extend the existing design
> > in a way that just avoids all these problems in the first place. Maybe
> > "de-synchronization" really can pay for itself (because the benefits
> > will outweigh these costs), but if you go that way then I'd really
> > prefer it that way.
>
> Forcing each index access to be synchronous and interleaved with each
> table access seems like an unprincipled design constraint. While it is
> true that we rely on that in our current implementation (when using
> non-MVCC snapshots), it doesn't seem like a principle inherent to
> accessing indexes and tables.

There is nothing sacred about the way plain index scans work right now
-- especially the part about buffer pins as an interlock.

If the pin thing really was sacred, then we could never have allowed
nbtree to selectively opt-out in cases where it's possible to provide
an equivalent correctness guarantee without holding onto buffer pins,
which, as I went into, is how it actually works in nbtree's
_bt_killitems() today (see commit 2ed5b87f96 for full details). And so
in principle I have no problem with the idea of revising the basic
definition of plain index scans -- especially if it's to make the
definition more abstract, without fundamentally changing it (e.g., to
make it no longer reference buffer pins, making life easier for
prefetching, while at the same time still implying the same underlying
guarantees sufficient to allow nbtree to mostly work the same way as
today).

All I'm really saying is:

1. The sort of tricks that we can do in nbtree's _bt_killitems() are
quite useful, and ought to be preserved in something like their
current form, even when prefetching is in use.

This seems to push things in the direction of centralizing control of
the process in index scan code. For example, it has to understand that
_bt_killitems() will be called at some regular cadence that is well
defined and sensible from an index point of view.

2. Are you sure that the leaf-page-at-a-time thing is such a huge
hindrance to effective prefetching?

I suppose that it might be much more important than I imagine it is
right now, but it'd be nice to have something a bit more concrete to
go on.

3. Even if it is somewhat important, do you really need to get that
part working in v1?

Tomas' original prototype worked with the leaf-page-at-a-time thing,
and that still seemed like a big improvement to me. While being less
invasive, in effect. If we can agree that something like that
represents a useful step in the right direction (not an evolutionary
dead end), then we can make good incremental progress within a single
release.

> I don't think the fact that it would also be valuable to do index
> prefetching is a reason not to do prefetching of heap pages. And,
> while it is true that were you to add index interior or leaf page
> prefetching, it would impact the heap prefetching, at the end of the
> day, the table AM needs some TID or TID-equivalents that whose blocks
> it can go fetch.

I wasn't really thinking of index page prefetching at all. Just the
cost of applying index quals to read leaf pages that might never
actually need to be read, due to the presence of a LIMIT. That is kind
of a new problem created by eagerly reading (without actually
prefetching) leaf pages.

> You could argue that my suggestion to have the index AM manage and
> populate a queue of TIDs for use by the table AM puts the index AM in
> control. I do think having so many members of the IndexScanDescriptor
> which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
> interplay between fetching an index tuple and fetching a heap tuple is
> confusing and error prone.

But that's k

Re: index prefetching

2024-02-14 Thread Melanie Plageman
ignificant fraction of the overall cost of the index scan.
> Especially with an expensive index qual. So if you just assume that
> the TIDs returned by the index scan are the only thing that matters,
> you might have a model that's basically correct on average, but is
> occasionally very wrong. That's one reason for "putting the index AM
> in control".

I don't think the fact that it would also be valuable to do index
prefetching is a reason not to do prefetching of heap pages. And,
while it is true that were you to add index interior or leaf page
prefetching, it would impact the heap prefetching, at the end of the
day, the table AM needs some TID or TID-equivalents that whose blocks
it can go fetch. The index AM has to produce something that the table
AM will consume. So, if we add prefetching of heap pages and get the
table AM input right, it shouldn't require a full redesign to add
index page prefetching later.

You could argue that my suggestion to have the index AM manage and
populate a queue of TIDs for use by the table AM puts the index AM in
control. I do think having so many members of the IndexScanDescriptor
which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
interplay between fetching an index tuple and fetching a heap tuple is
confusing and error prone.

> As I said back in June, we should probably be marrying information
> from the index scan with information from the heap. This is something
> that is arguably a modularity violation. But it might just be that you
> really do need to take information from both places to consistently
> make the right trade-off.

Agreed that we are going to need to mix information from both places.

> If you take a look at _bt_killitems(), you'll see that it actually has
> two fairly different strategies for avoiding TID recycling race
> condition issues, applied in each of two different cases:
>
> 1. Cases where we really have held onto a buffer pin, per the index AM
> API -- the "inde AM orthodox" approach.  (The aforementioned issue
> with unlogged indexes exists because with an unlogged index we must
> use approach 1, per the nbtree README section [1]).
>
> 2. Cases where we drop the pin as an optimization (also per [1]), and
> now have to detect the possibility of concurrent modifications by
> VACUUM (that could have led to concurrent TID recycling). We
> conservatively do nothing (don't mark any index tuples LP_DEAD),
> unless the LSN is exactly the same as it was back when the page was
> scanned/read by _bt_readpage().

Re 2: so the LSN could have been changed by some other process (i.e.
not vacuum), so how often in practice is the LSN actually the same as
when the page was scanned/read? Do you think we would catch a
meaningful number of kill prior tuple opportunities if we used an LSN
tracking method like this? Something that let us drop the pin on the
page would obviously be better.

- Melanie




Re: index prefetching

2024-02-14 Thread Melanie Plageman
On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman
 wrote:
>
> On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
>  wrote:
> >
> > On 2/7/24 22:48, Melanie Plageman wrote:
> > > ...
> > > - switching scan directions
> > >
> > > If the index scan switches directions on a given invocation of
> > > IndexNext(), heap blocks may have already been prefetched and read for
> > > blocks containing tuples beyond the point at which we want to switch
> > > directions.
> > >
> > > We could fix this by having some kind of streaming read "reset"
> > > callback to drop all of the buffers which have been prefetched which
> > > are now no longer needed. We'd have to go backwards from the last TID
> > > which was yielded to the caller and figure out which buffers in the
> > > pgsr buffer ranges are associated with all of the TIDs which were
> > > prefetched after that TID. The TIDs are in the per_buffer_data
> > > associated with each buffer in pgsr. The issue would be searching
> > > through those efficiently.
> > >
> >
> > Yeah, that's roughly what I envisioned in one of my previous messages
> > about this issue - walking back the TIDs read from the index and added
> > to the prefetch queue.
> >
> > > The other issue is that the streaming read API does not currently
> > > support backwards scans. So, if we switch to a backwards scan from a
> > > forwards scan, we would need to fallback to the non streaming read
> > > method. We could do this by just setting the TID queue size to 1
> > > (which is what I have currently implemented). Or we could add
> > > backwards scan support to the streaming read API.
> > >
> >
> > What do you mean by "support for backwards scans" in the streaming read
> > API? I imagined it naively as
> >
> > 1) drop all requests in the streaming read API queue
> >
> > 2) walk back all "future" requests in the TID queue
> >
> > 3) start prefetching as if from scratch
> >
> > Maybe there's a way to optimize this and reuse some of the work more
> > efficiently, but my assumption is that the scan direction does not
> > change very often, and that we process many items in between.
>
> Yes, the steps you mention for resetting the queues make sense. What I
> meant by "backwards scan is not supported by the streaming read API"
> is that Thomas/Andres had mentioned that the streaming read API does
> not support backwards scans right now. Though, since the callback just
> returns a block number, I don't know how it would break.
>
> When switching between a forwards and backwards scan, does it go
> backwards from the current position or start at the end (or beginning)
> of the relation?

Okay, well I answered this question for myself, by, um, trying it :).
FETCH backward will go backwards from the current cursor position. So,
I don't see exactly why this would be an issue.

> If it is the former, then the blocks would most
> likely be in shared buffers -- which the streaming read API handles.
> It is not obvious to me from looking at the code what the gap is, so
> perhaps Thomas could weigh in.

I have the same problem with the sequential scan streaming read user,
so I am going to try and figure this backwards scan and switching scan
direction thing there (where we don't have other issues).

- Melanie




Re: index prefetching

2024-02-14 Thread Peter Geoghegan
On Wed, Feb 14, 2024 at 11:40 AM Melanie Plageman
 wrote:
> I wasn't quite sure how we could use
> index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
> suggestion. But, I'd like to understand.

The point I was trying to make with that example was: a highly generic
mechanism can sometimes work across disparate index AMs (that all at
least support plain index scans) when it just so happens that these
AMs don't actually differ in a way that could possibly matter to that
mechanism. While it's true that (say) nbtree and hash are very
different at a high level, it's nevertheless also true that the way
things work at the level of individual index pages is much more
similar than different.

With index deletion, we know that we're differences between each
supported index AM either don't matter at all (which is what obviates
the need for index_compute_xid_horizon_for_tuples() to be directly
aware of which index AM the page it is passed comes from), or matter
only in small, incidental ways (e.g., nbtree stores posting lists in
its tuples, despite using IndexTuple structs).

With prefetching, it seems reasonable to suppose that an index-AM
specific approach would end up needing very little truly custom code.
This is pretty strongly suggested by the fact that the rules around
buffer pins (as an interlock against concurrent TID recycling by
VACUUM) are standardized by the index AM API itself. Those rules might
be slightly more natural with nbtree, but that's kinda beside the
point. While the basic organizing principle for where each index tuple
goes can vary enormously, it doesn't necessarily matter at all -- in
the end, you're really just reading each index page (that has TIDs to
read) exactly once per scan, in some fixed order, with interlaced
inline heap accesses (that go fetch heap tuples for each individual
TID read from each index page).

In general I don't accept that we need to do things outside the index
AM, because software architecture encapsulation something something. I
suspect that we'll need to share some limited information across
different layers of abstraction, because that's just fundamentally
what's required by the constraints we're operating under. Can't really
prove it, though.

-- 
Peter Geoghegan




Re: index prefetching

2024-02-14 Thread Peter Geoghegan
On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra
 wrote:
> > Another thing that argues against doing this is that we might not need
> > to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> > We could end up scanning a whole extra leaf page (including all of its
> > tuples) for want of the ability to "push down" a LIMIT to the index AM
> > (that's not what happens right now, but it isn't really needed at all
> > right now).
> >
>
> I'm not quite sure I understand what is "this" that you argue against.
> Are you saying we should not separate the two scans? If yes, is there a
> better way to do this?

What I'm concerned about is the difficulty and complexity of any
design that requires revising "63.4. Index Locking Considerations",
since that's pretty subtle stuff. In particular, if prefetching
"de-synchronizes" (to use your term) the index leaf page level scan
and the heap page scan, then we'll probably have to totally revise the
basic API.

Maybe that'll actually turn out to be the right thing to do -- it
could just be the only thing that can unleash the full potential of
prefetching. But I'm not aware of any evidence that points in that
direction. Are you? (I might have just missed it.)

> The LIMIT problem is not very clear to me either. Yes, if we get close
> to the end of the leaf page, we may need to visit the next leaf page.
> But that's kinda the whole point of prefetching - reading stuff ahead,
> and reading too far ahead is an inherent risk. Isn't that a problem we
> have even without LIMIT? The prefetch distance ramp up is meant to limit
> the impact.

Right now, the index AM doesn't know anything about LIMIT at all. That
doesn't matter, since the index AM can only read/scan one full leaf
page before returning control back to the executor proper. The
executor proper can just shut down the whole index scan upon finding
that we've already returned N tuples for a LIMIT N.

We don't do prefetching right now, but we also don't risk reading a
leaf page that'll just never be needed. Those two things are in
tension, but I don't think that that's quite the same thing as the
usual standard prefetching tension/problem. Here there is uncertainty
about whether what we're prefetching will *ever* be required -- not
uncertainty about when exactly it'll be required. (Perhaps this
distinction doesn't mean much to you. I'm just telling you how I think
about it, in case it helps move the discussion forward.)

> > This property of index scans is fundamental to how index scans work.
> > Pinning an index page as an interlock against concurrently TID
> > recycling by VACUUM is directly described by the index API docs [1],
> > even (the docs actually use terms like "buffer pin" rather than
> > something more abstract sounding). I don't think that anything
> > affecting that behavior should be considered an implementation detail
> > of the nbtree index AM as such (nor any particular index AM).
> >
>
> Good point.

The main reason why the index AM docs require this interlock is
because we need such an interlock to make non-MVCC snapshot scans
safe. If you remove the interlock (the buffer pin interlock that
protects against TID recycling by VACUUM), you can still avoid the
same race condition by using an MVCC snapshot. This is why using an
MVCC snapshot is a requirement for bitmap index scans. I believe that
it's also a requirement for index-only scans, but the index AM docs
don't spell that out.

Another factor that complicates things here is mark/restore
processing. The design for that has the idea of processing one page at
a time baked-in. Kinda like with the kill_prior_tuple issue.

It's certainly possible that you could figure out various workarounds
for each of these issues (plus the kill_prior_tuple issue) with a
prefetching design that "de-synchronizes" the index access and the
heap access. But it might well be better to extend the existing design
in a way that just avoids all these problems in the first place. Maybe
"de-synchronization" really can pay for itself (because the benefits
will outweigh these costs), but if you go that way then I'd really
prefer it that way.

> > I think that it makes sense to put the index AM in control here --
> > that almost follows from what I said about the index AM API. The index
> > AM already needs to be in control, in about the same way, to deal with
> > kill_prior_tuple (plus it helps with the  LIMIT issue I described).
> >
>
> In control how? What would be the control flow - what part would be
> managed by the index AM?

ISTM that prefetching for an index scan is about the index scan
itself, first and foremost. The heap accesses are usually the dominant
cost, of course, but sometimes the index leaf page accesses really do
make up a significant fraction of the overall cost of the index scan.
Especially with an expensive index qual. So if you just assume that
the TIDs returned by the index scan are the only thing that matters,
you might have a model that's 

Re: index prefetching

2024-02-14 Thread Melanie Plageman
On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
 wrote:
>
> On 2/7/24 22:48, Melanie Plageman wrote:
> > ...
Issues
> > ---
> > - kill prior tuple
> >
> > This optimization doesn't work with index prefetching with the current
> > design. Kill prior tuple relies on alternating between fetching a
> > single index tuple and visiting the heap. After visiting the heap we
> > can potentially kill the immediately preceding index tuple. Once we
> > fetch multiple index tuples, enqueue their TIDs, and later visit the
> > heap, the next index page we visit may not contain all of the index
> > tuples deemed killable by our visit to the heap.
> >
>
> I admit I haven't thought about kill_prior_tuple until you pointed out.
> Yeah, prefetching separates (de-synchronizes) the two scans (index and
> heap) in a way that prevents this optimization. Or at least makes it
> much more complex :-(
>
> > In our case, we could try and fix this by prefetching only heap blocks
> > referred to by index tuples on the same index page. Or we could try
> > and keep a pool of index pages pinned and go back and kill index
> > tuples on those pages.
> >
>
> I think restricting the prefetching to a single index page would not be
> a huge issue performance-wise - that's what the initial patch version
> (implemented at the index AM level) did, pretty much. The prefetch queue
> would get drained as we approach the end of the index page, but luckily
> index pages tend to have a lot of entries. But it'd put an upper bound
> on the prefetch distance (much lower than the e_i_c maximum 1000, but
> I'd say common values are 10-100 anyway).
>
> But how would we know we're on the same index page? That knowledge is
> not available outside the index AM - the executor or indexam.c does not
> know this, right? Presumably we could expose this, somehow, but it seems
> like a violation of the abstraction ...

The easiest way to do this would be to have the index AM amgettuple()
functions set a new member in the IndexScanDescData which is either
the index page identifier or a boolean that indicates we have moved on
to the next page. Then, when filling the queue, we would stop doing so
when the page switches. Now, this wouldn't really work for the first
index tuple on each new page, so, perhaps we would need the index AMs
to implement some kind of "peek" functionality.

Or, we could provide the index AM with a max queue size and allow it
to fill up the queue with the TIDs it wants (which it could keep to
the same index page). And, for the index-only scan case, could have
some kind of flag which indicates if the caller is putting
TIDs+HeapTuples or TIDS+IndexTuples on the queue, which might reduce
the amount of space we need. I'm not sure who manages the memory here.

I wasn't quite sure how we could use
index_compute_xid_horizon_for_tuples() for inspiration -- per Peter's
suggestion. But, I'd like to understand.

> > - switching scan directions
> >
> > If the index scan switches directions on a given invocation of
> > IndexNext(), heap blocks may have already been prefetched and read for
> > blocks containing tuples beyond the point at which we want to switch
> > directions.
> >
> > We could fix this by having some kind of streaming read "reset"
> > callback to drop all of the buffers which have been prefetched which
> > are now no longer needed. We'd have to go backwards from the last TID
> > which was yielded to the caller and figure out which buffers in the
> > pgsr buffer ranges are associated with all of the TIDs which were
> > prefetched after that TID. The TIDs are in the per_buffer_data
> > associated with each buffer in pgsr. The issue would be searching
> > through those efficiently.
> >
>
> Yeah, that's roughly what I envisioned in one of my previous messages
> about this issue - walking back the TIDs read from the index and added
> to the prefetch queue.
>
> > The other issue is that the streaming read API does not currently
> > support backwards scans. So, if we switch to a backwards scan from a
> > forwards scan, we would need to fallback to the non streaming read
> > method. We could do this by just setting the TID queue size to 1
> > (which is what I have currently implemented). Or we could add
> > backwards scan support to the streaming read API.
> >
>
> What do you mean by "support for backwards scans" in the streaming read
> API? I imagined it naively as
>
> 1) drop all requests in the streaming read API queue
>
> 2) walk back all "future" requests in the TID queue
>
> 3) start prefetching as if from scratch
>
> Maybe there's a way to optimize this and reuse s

Re: index prefetching

2024-02-14 Thread Tomas Vondra



On 2/14/24 08:10, Robert Haas wrote:
> On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman
>  wrote:
>> - kill prior tuple
>>
>> This optimization doesn't work with index prefetching with the current
>> design. Kill prior tuple relies on alternating between fetching a
>> single index tuple and visiting the heap. After visiting the heap we
>> can potentially kill the immediately preceding index tuple. Once we
>> fetch multiple index tuples, enqueue their TIDs, and later visit the
>> heap, the next index page we visit may not contain all of the index
>> tuples deemed killable by our visit to the heap.
> 
> Is this maybe just a bookkeeping problem? A Boolean that says "you can
> kill the prior tuple" is well-suited if and only if the prior tuple is
> well-defined. But perhaps it could be replaced with something more
> sophisticated that tells you which tuples are eligible to be killed.
> 

I don't think it's just a bookkeeping problem. In a way, nbtree already
does keep an array of tuples to kill (see btgettuple), but it's always
for the current index page. So it's not that we immediately go and kill
the prior tuple - nbtree already stashes it in an array, and kills all
those tuples when moving to the next index page.

The way I understand the problem is that with prefetching we're bound to
determine the kill_prior_tuple flag with a delay, in which case we might
have already moved to the next index page ...


So to make this work, we'd need to:

1) keep index pages pinned for all "in flight" TIDs (read from the
index, not yet consumed by the index scan)

2) keep a separate array of "to be killed" index tuples for each page

3) have a more sophisticated way to decide when to kill tuples and unpin
the index page (instead of just doing it when moving to the next index page)

Maybe that's what you meant by "more sophisticated bookkeeping", ofc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-02-14 Thread Tomas Vondra
On 2/13/24 20:54, Peter Geoghegan wrote:
> On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
>  wrote:
>> On 2/7/24 22:48, Melanie Plageman wrote:
>> I admit I haven't thought about kill_prior_tuple until you pointed out.
>> Yeah, prefetching separates (de-synchronizes) the two scans (index and
>> heap) in a way that prevents this optimization. Or at least makes it
>> much more complex :-(
> 
> Another thing that argues against doing this is that we might not need
> to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> We could end up scanning a whole extra leaf page (including all of its
> tuples) for want of the ability to "push down" a LIMIT to the index AM
> (that's not what happens right now, but it isn't really needed at all
> right now).
> 

I'm not quite sure I understand what is "this" that you argue against.
Are you saying we should not separate the two scans? If yes, is there a
better way to do this?

The LIMIT problem is not very clear to me either. Yes, if we get close
to the end of the leaf page, we may need to visit the next leaf page.
But that's kinda the whole point of prefetching - reading stuff ahead,
and reading too far ahead is an inherent risk. Isn't that a problem we
have even without LIMIT? The prefetch distance ramp up is meant to limit
the impact.

> This property of index scans is fundamental to how index scans work.
> Pinning an index page as an interlock against concurrently TID
> recycling by VACUUM is directly described by the index API docs [1],
> even (the docs actually use terms like "buffer pin" rather than
> something more abstract sounding). I don't think that anything
> affecting that behavior should be considered an implementation detail
> of the nbtree index AM as such (nor any particular index AM).
> 

Good point.

> I think that it makes sense to put the index AM in control here --
> that almost follows from what I said about the index AM API. The index
> AM already needs to be in control, in about the same way, to deal with
> kill_prior_tuple (plus it helps with the  LIMIT issue I described).
> 

In control how? What would be the control flow - what part would be
managed by the index AM?

I initially did the prefetching entirely in each index AM, but it was
suggested doing this in the executor would be better. So I gradually
moved it to executor. But the idea to combine this with the streaming
read API seems as a move from executor back to the lower levels ... and
now you're suggesting to make the index AM responsible for this again.

I'm not saying any of those layering options is wrong, but it's not
clear to me which is the right one.

> There doesn't necessarily need to be much code duplication to make
> that work. Offhand I suspect it would be kind of similar to how
> deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets
> by with generic logic implemented by
> index_compute_xid_horizon_for_tuples -- that's all that we need to
> determine a snapshotConflictHorizon value for recovery conflict
> purposes. Note that index_compute_xid_horizon_for_tuples() reads
> *index* pages, despite not being aware of the caller's index AM and
> index tuple format.
> 
> (The only reason why nbtree needs a custom solution is because it has
> posting list tuples to worry about, unlike GiST and unlike Hash, which
> consistently use unadorned generic IndexTuple structs with heap TID
> represented in the standard/generic way only. While these concepts
> probably all originated in nbtree, they're still not nbtree
> implementation details.)
> 

I haven't looked at the details, but I agree the LP_DEAD deletion seems
like a sensible inspiration.

>>> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
>>> there is an easier way to fix this, as I don't think the mvcc test
>>> failed on Tomas' version.
>>>
>>
>> I kinda doubt it worked correctly, considering I simply ignored the
>> optimization. It's far more likely it just worked by luck.
> 
> The test that did fail will have only revealed that the
> kill_prior_tuple wasn't operating as  expected -- which isn't the same
> thing as giving wrong answers.
> 

Possible. But AFAIK it did fail for Melanie, and I don't have a very
good explanation for the difference in behavior.

> Note that there are various ways that concurrent TID recycling might
> prevent _bt_killitems() from setting LP_DEAD bits. It's totally
> unsurprising that breaking kill_prior_tuple in some way could be
> missed. Andres wrote the MVCC test in question precisely because
> certain aspects of kill_prior_tuple were broken for months without
> anybody noticing.
> 
> [1] https://www.postgresql.org/docs/devel/index-locking.html

Yeah. There's clearly plenty of space for subtle issues.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-02-13 Thread Robert Haas
On Thu, Feb 8, 2024 at 3:18 AM Melanie Plageman
 wrote:
> - kill prior tuple
>
> This optimization doesn't work with index prefetching with the current
> design. Kill prior tuple relies on alternating between fetching a
> single index tuple and visiting the heap. After visiting the heap we
> can potentially kill the immediately preceding index tuple. Once we
> fetch multiple index tuples, enqueue their TIDs, and later visit the
> heap, the next index page we visit may not contain all of the index
> tuples deemed killable by our visit to the heap.

Is this maybe just a bookkeeping problem? A Boolean that says "you can
kill the prior tuple" is well-suited if and only if the prior tuple is
well-defined. But perhaps it could be replaced with something more
sophisticated that tells you which tuples are eligible to be killed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: index prefetching

2024-02-13 Thread Peter Geoghegan
On Tue, Feb 13, 2024 at 2:01 PM Tomas Vondra
 wrote:
> On 2/7/24 22:48, Melanie Plageman wrote:
> I admit I haven't thought about kill_prior_tuple until you pointed out.
> Yeah, prefetching separates (de-synchronizes) the two scans (index and
> heap) in a way that prevents this optimization. Or at least makes it
> much more complex :-(

Another thing that argues against doing this is that we might not need
to visit any more B-Tree leaf pages when there is a LIMIT n involved.
We could end up scanning a whole extra leaf page (including all of its
tuples) for want of the ability to "push down" a LIMIT to the index AM
(that's not what happens right now, but it isn't really needed at all
right now).

This property of index scans is fundamental to how index scans work.
Pinning an index page as an interlock against concurrently TID
recycling by VACUUM is directly described by the index API docs [1],
even (the docs actually use terms like "buffer pin" rather than
something more abstract sounding). I don't think that anything
affecting that behavior should be considered an implementation detail
of the nbtree index AM as such (nor any particular index AM).

I think that it makes sense to put the index AM in control here --
that almost follows from what I said about the index AM API. The index
AM already needs to be in control, in about the same way, to deal with
kill_prior_tuple (plus it helps with the  LIMIT issue I described).

There doesn't necessarily need to be much code duplication to make
that work. Offhand I suspect it would be kind of similar to how
deletion of LP_DEAD-marked index tuples by non-nbtree index AMs gets
by with generic logic implemented by
index_compute_xid_horizon_for_tuples -- that's all that we need to
determine a snapshotConflictHorizon value for recovery conflict
purposes. Note that index_compute_xid_horizon_for_tuples() reads
*index* pages, despite not being aware of the caller's index AM and
index tuple format.

(The only reason why nbtree needs a custom solution is because it has
posting list tuples to worry about, unlike GiST and unlike Hash, which
consistently use unadorned generic IndexTuple structs with heap TID
represented in the standard/generic way only. While these concepts
probably all originated in nbtree, they're still not nbtree
implementation details.)

> > Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
> > there is an easier way to fix this, as I don't think the mvcc test
> > failed on Tomas' version.
> >
>
> I kinda doubt it worked correctly, considering I simply ignored the
> optimization. It's far more likely it just worked by luck.

The test that did fail will have only revealed that the
kill_prior_tuple wasn't operating as  expected -- which isn't the same
thing as giving wrong answers.

Note that there are various ways that concurrent TID recycling might
prevent _bt_killitems() from setting LP_DEAD bits. It's totally
unsurprising that breaking kill_prior_tuple in some way could be
missed. Andres wrote the MVCC test in question precisely because
certain aspects of kill_prior_tuple were broken for months without
anybody noticing.

[1] https://www.postgresql.org/docs/devel/index-locking.html
-- 
Peter Geoghegan




Re: index prefetching

2024-02-13 Thread Tomas Vondra
On 2/7/24 22:48, Melanie Plageman wrote:
> ...
> 
> Attached is a patch which implements a real queue and fixes some of
> the issues with the previous version. It doesn't pass tests yet and
> has issues. Some are bugs in my implementation I need to fix. Some are
> issues we would need to solve in the streaming read API. Some are
> issues with index prefetching generally.
> 
> Note that these two patches have to be applied before 21d9c3ee4e
> because Thomas hasn't released a rebased version of the streaming read
> API patches yet.
> 

Thanks for working on this, and for investigating the various issues.

> Issues
> ---
> - kill prior tuple
> 
> This optimization doesn't work with index prefetching with the current
> design. Kill prior tuple relies on alternating between fetching a
> single index tuple and visiting the heap. After visiting the heap we
> can potentially kill the immediately preceding index tuple. Once we
> fetch multiple index tuples, enqueue their TIDs, and later visit the
> heap, the next index page we visit may not contain all of the index
> tuples deemed killable by our visit to the heap.
> 

I admit I haven't thought about kill_prior_tuple until you pointed out.
Yeah, prefetching separates (de-synchronizes) the two scans (index and
heap) in a way that prevents this optimization. Or at least makes it
much more complex :-(

> In our case, we could try and fix this by prefetching only heap blocks
> referred to by index tuples on the same index page. Or we could try
> and keep a pool of index pages pinned and go back and kill index
> tuples on those pages.
> 

I think restricting the prefetching to a single index page would not be
a huge issue performance-wise - that's what the initial patch version
(implemented at the index AM level) did, pretty much. The prefetch queue
would get drained as we approach the end of the index page, but luckily
index pages tend to have a lot of entries. But it'd put an upper bound
on the prefetch distance (much lower than the e_i_c maximum 1000, but
I'd say common values are 10-100 anyway).

But how would we know we're on the same index page? That knowledge is
not available outside the index AM - the executor or indexam.c does not
know this, right? Presumably we could expose this, somehow, but it seems
like a violation of the abstraction ...

The same thing affects keeping multiple index pages pinned, for TIDs
that are yet to be used by the index scan. We'd need to know when to
release a pinned page, once we're done with processing all items.

FWIW I haven't tried to implementing any of this, so maybe I'm missing
something and it can be made to work in a nice way.


> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
> there is an easier way to fix this, as I don't think the mvcc test
> failed on Tomas' version.
> 

I kinda doubt it worked correctly, considering I simply ignored the
optimization. It's far more likely it just worked by luck.


> - switching scan directions
> 
> If the index scan switches directions on a given invocation of
> IndexNext(), heap blocks may have already been prefetched and read for
> blocks containing tuples beyond the point at which we want to switch
> directions.
> 
> We could fix this by having some kind of streaming read "reset"
> callback to drop all of the buffers which have been prefetched which
> are now no longer needed. We'd have to go backwards from the last TID
> which was yielded to the caller and figure out which buffers in the
> pgsr buffer ranges are associated with all of the TIDs which were
> prefetched after that TID. The TIDs are in the per_buffer_data
> associated with each buffer in pgsr. The issue would be searching
> through those efficiently.
> 

Yeah, that's roughly what I envisioned in one of my previous messages
about this issue - walking back the TIDs read from the index and added
to the prefetch queue.

> The other issue is that the streaming read API does not currently
> support backwards scans. So, if we switch to a backwards scan from a
> forwards scan, we would need to fallback to the non streaming read
> method. We could do this by just setting the TID queue size to 1
> (which is what I have currently implemented). Or we could add
> backwards scan support to the streaming read API.
> 

What do you mean by "support for backwards scans" in the streaming read
API? I imagined it naively as

1) drop all requests in the streaming read API queue

2) walk back all "future" requests in the TID queue

3) start prefetching as if from scratch

Maybe there's a way to optimize this and reuse some of the work more
efficiently, but my assumption is that the scan direction does not
change very often, and that we process many items in between.


> - mark and restore
> 
> Similar to the issue

Re: index prefetching

2024-02-07 Thread Melanie Plageman
On Wed, Jan 24, 2024 at 3:20 PM Melanie Plageman
 wrote:
>
> On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
>  wrote:
> >
> > On 1/24/24 01:51, Melanie Plageman wrote:
> > >> But I'm not sure what to do about optimizations that are more specific
> > >> to the access path. Consider for example the index-only scans. We don't
> > >> want to prefetch all the pages, we need to inspect the VM and prefetch
> > >> just the not-all-visible ones. And then pass the info to the index scan,
> > >> so that it does not need to check the VM again. It's not clear to me how
> > >> to do this with this approach.
> > >
> > > Yea, this is an issue I'll need to think about. To really spell out
> > > the problem: the callback dequeues a TID from the tid_queue and looks
> > > up its block in the VM. It's all visible. So, it shouldn't return that
> > > block to the streaming read API to fetch from the heap because it
> > > doesn't need to be read. But, where does the callback put the TID so
> > > that the caller can get it? I'm going to think more about this.
> > >
> >
> > Yes, that's the problem for index-only scans. I'd generalize it so that
> > it's about the callback being able to (a) decide if it needs to read the
> > heap page, and (b) store some custom info for the TID.
>
> Actually, I think this is no big deal. See attached. I just don't
> enqueue tids whose blocks are all visible. I had to switch the order
> from fetch heap then fill queue to fill queue then fetch heap.
>
> While doing this I noticed some wrong results in the regression tests
> (like in the alter table test), so I suspect I have some kind of
> control flow issue. Perhaps I should fix the resource leak so I can
> actually see the failing tests :)

Attached is a patch which implements a real queue and fixes some of
the issues with the previous version. It doesn't pass tests yet and
has issues. Some are bugs in my implementation I need to fix. Some are
issues we would need to solve in the streaming read API. Some are
issues with index prefetching generally.

Note that these two patches have to be applied before 21d9c3ee4e
because Thomas hasn't released a rebased version of the streaming read
API patches yet.

Issues
---
- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

In our case, we could try and fix this by prefetching only heap blocks
referred to by index tuples on the same index page. Or we could try
and keep a pool of index pages pinned and go back and kill index
tuples on those pages.

Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
there is an easier way to fix this, as I don't think the mvcc test
failed on Tomas' version.

- switching scan directions

If the index scan switches directions on a given invocation of
IndexNext(), heap blocks may have already been prefetched and read for
blocks containing tuples beyond the point at which we want to switch
directions.

We could fix this by having some kind of streaming read "reset"
callback to drop all of the buffers which have been prefetched which
are now no longer needed. We'd have to go backwards from the last TID
which was yielded to the caller and figure out which buffers in the
pgsr buffer ranges are associated with all of the TIDs which were
prefetched after that TID. The TIDs are in the per_buffer_data
associated with each buffer in pgsr. The issue would be searching
through those efficiently.

The other issue is that the streaming read API does not currently
support backwards scans. So, if we switch to a backwards scan from a
forwards scan, we would need to fallback to the non streaming read
method. We could do this by just setting the TID queue size to 1
(which is what I have currently implemented). Or we could add
backwards scan support to the streaming read API.

- mark and restore

Similar to the issue with switching the scan direction, mark and
restore requires us to reset the TID queue and streaming read queue.
For now, I've hacked in something to the PlannerInfo and Plan to set
the TID queue size to 1 for plans containing a merge join (yikes).

- multiple executions

For reasons I don't entirely understand yet, multiple executions (not
rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
patch, I have disabled prefetching (and made the TID queue size 1)
when execute_once is false.

- Index Only Scans nee

Re: index prefetching

2024-01-25 Thread Tomas Vondra



On 1/25/24 11:45, Dilip Kumar wrote:
> On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra
>  wrote:
> 
>> On 1/22/24 07:35, Konstantin Knizhnik wrote:
>>>
>>> On 22/01/2024 1:47 am, Tomas Vondra wrote:
 h, right. Well, you're right in this case we perhaps could set just one
 of those flags, but the "purpose" of the two places is quite different.

 The "prefetch" flag is fully controlled by the prefetcher, and it's up
 to it to change it (e.g. I can easily imagine some new logic touching
 setting it to "false" for some reason).

 The "data" flag is fully controlled by the custom callbacks, so whatever
 the callback stores, will be there.

 I don't think it's worth simplifying this. In particular, I don't think
 the callback can assume it can rely on the "prefetch" flag.

>>> Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
>>> cause any extra space overhead (because of alignment), but allows to
>>> avoid dynamic memory allocation (not sure if it is critical, but nice to
>>> avoid if possible).
>>>
>>
> While reading through the first patch I got some questions, I haven't
> read it complete yet but this is what I got so far.
> 
> 1.
> +static bool
> +IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
> +{
> + int idx;
> ...
> + if (prefetch->blockItems[idx] != (block - i))
> + return false;
> +
> + /* Don't prefetch if the block happens to be the same. */
> + if (prefetch->blockItems[idx] == block)
> + return false;
> + }
> +
> + /* not sequential, not recently prefetched */
> + return true;
> +}
> 
> The above function name is BlockIsSequential but at the end, it
> returns true if it is not sequential, seem like a problem?

Actually, I think it's the comment that's wrong - the last return is
reached only for a sequential pattern (and when the block was not
accessed recently).

> Also other 2 checks right above the end of the function are returning
> false if the block is the same or the pattern is sequential I think
> those are wrong too.
> 

Hmmm. You're right this is partially wrong. There are two checks:

/*
 * For a sequential pattern, blocks "k" step ago needs to have block
 * number by "k" smaller compared to the current block.
 */
if (prefetch->blockItems[idx] != (block - i))
return false;

/* Don't prefetch if the block happens to be the same. */
if (prefetch->blockItems[idx] == block)
return false;

The first condition is correct - we want to return "false" when the
pattern is not sequential.

But the second condition is wrong - we want to skip prefetching when the
block was already prefetched recently, so this should return true (which
is a bit misleading, as it seems to imply the pattern is sequential,
when it's not).

However, this is harmless, because we then identify this block as
recently prefetched in the "full" cache check, so we won't prefetch it
anyway. So it's harmless, although a bit more expensive.

There's another inefficiency - we stop looking for the same block once
we find the first block breaking the non-sequential pattern. Imagine a
sequence of blocks 1, 2, 3, 1, 2, 3, ... in which case we never notice
the block was recently prefetched, because we always find the break of
the sequential pattern. But again, it's harmless, thanks to the full
cache of recently prefetched blocks.

>  2.
>  I have noticed that the prefetch history is maintained at the backend
> level, but what if multiple backends are trying to fetch the same heap
> blocks maybe scanning the same index, so should that be in some shared
> structure?  I haven't thought much deeper about this from the
> implementation POV, but should we think about it, or it doesn't
> matter?

Yes, the cache is at the backend level - it's a known limitation, but I
see it more as a conscious tradeoff.

Firstly, while the LRU cache is at backend level, PrefetchBuffer also
checks shared buffers for each prefetch request. So with sufficiently
large shared buffers we're likely to find it there (and for direct I/O
there won't be any other place to check).

Secondly, the only other place to check is page cache, but there's no
good (sufficiently cheap) way to check that. See the preadv2/nowait
experiment earlier in this thread.

I suppose we could implement a similar LRU cache for shared memory (and
I don't think it'd be very complicated), but I did not plan to do that
in this patch unless absolutely necessary.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-25 Thread Dilip Kumar
On Wed, Jan 24, 2024 at 11:43 PM Tomas Vondra
 wrote:

> On 1/22/24 07:35, Konstantin Knizhnik wrote:
> >
> > On 22/01/2024 1:47 am, Tomas Vondra wrote:
> >> h, right. Well, you're right in this case we perhaps could set just one
> >> of those flags, but the "purpose" of the two places is quite different.
> >>
> >> The "prefetch" flag is fully controlled by the prefetcher, and it's up
> >> to it to change it (e.g. I can easily imagine some new logic touching
> >> setting it to "false" for some reason).
> >>
> >> The "data" flag is fully controlled by the custom callbacks, so whatever
> >> the callback stores, will be there.
> >>
> >> I don't think it's worth simplifying this. In particular, I don't think
> >> the callback can assume it can rely on the "prefetch" flag.
> >>
> > Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
> > cause any extra space overhead (because of alignment), but allows to
> > avoid dynamic memory allocation (not sure if it is critical, but nice to
> > avoid if possible).
> >
>
While reading through the first patch I got some questions, I haven't
read it complete yet but this is what I got so far.

1.
+static bool
+IndexPrefetchBlockIsSequential(IndexPrefetch *prefetch, BlockNumber block)
+{
+ int idx;
...
+ if (prefetch->blockItems[idx] != (block - i))
+ return false;
+
+ /* Don't prefetch if the block happens to be the same. */
+ if (prefetch->blockItems[idx] == block)
+ return false;
+ }
+
+ /* not sequential, not recently prefetched */
+ return true;
+}

The above function name is BlockIsSequential but at the end, it
returns true if it is not sequential, seem like a problem?
Also other 2 checks right above the end of the function are returning
false if the block is the same or the pattern is sequential I think
those are wrong too.


 2.
 I have noticed that the prefetch history is maintained at the backend
level, but what if multiple backends are trying to fetch the same heap
blocks maybe scanning the same index, so should that be in some shared
structure?  I haven't thought much deeper about this from the
implementation POV, but should we think about it, or it doesn't
matter?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: index prefetching

2024-01-24 Thread Melanie Plageman
On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
 wrote:
>
> On 1/24/24 01:51, Melanie Plageman wrote:
>
> >>> There are also table AM layering violations in my sketch which would
> >>> have to be worked out (not to mention some resource leakage I didn't
> >>> bother investigating [which causes it to fail tests]).
> >>>
> >>> 0001 is all of Thomas' streaming read API code that isn't yet in
> >>> master and 0002 is my rough sketch of index prefetching using the
> >>> streaming read API
> >>>
> >>> There are also numerous optimizations that your index prefetching
> >>> patch set does that would need to be added in some way. I haven't
> >>> thought much about it yet. I wanted to see what you thought of this
> >>> approach first. Basically, is it workable?
> >>
> >> It seems workable, yes. I'm not sure it's much simpler than my patch
> >> (considering a lot of the code is in the optimizations, which are
> >> missing from this patch).
> >>
> >> I think the question is where should the optimizations happen. I suppose
> >> some of them might/should happen in the StreamingRead API itself - like
> >> the detection of sequential patterns, recently prefetched blocks, ...
> >
> > So, the streaming read API does detection of sequential patterns and
> > not prefetching things that are in shared buffers. It doesn't handle
> > avoiding prefetching recently prefetched blocks yet AFAIK. But I
> > daresay this would be relevant for other streaming read users and
> > could certainly be implemented there.
> >
>
> Yes, the "recently prefetched stuff" cache seems like a fairly natural
> complement to the pattern detection and shared-buffers check.
>
> FWIW I wonder if we should make some of this customizable, so that
> systems with customized storage (e.g. neon or with direct I/O) can e.g.
> disable some of these checks. Or replace them with their version.

That's a promising idea.

> >> But I'm not sure what to do about optimizations that are more specific
> >> to the access path. Consider for example the index-only scans. We don't
> >> want to prefetch all the pages, we need to inspect the VM and prefetch
> >> just the not-all-visible ones. And then pass the info to the index scan,
> >> so that it does not need to check the VM again. It's not clear to me how
> >> to do this with this approach.
> >
> > Yea, this is an issue I'll need to think about. To really spell out
> > the problem: the callback dequeues a TID from the tid_queue and looks
> > up its block in the VM. It's all visible. So, it shouldn't return that
> > block to the streaming read API to fetch from the heap because it
> > doesn't need to be read. But, where does the callback put the TID so
> > that the caller can get it? I'm going to think more about this.
> >
>
> Yes, that's the problem for index-only scans. I'd generalize it so that
> it's about the callback being able to (a) decide if it needs to read the
> heap page, and (b) store some custom info for the TID.

Actually, I think this is no big deal. See attached. I just don't
enqueue tids whose blocks are all visible. I had to switch the order
from fetch heap then fill queue to fill queue then fetch heap.

While doing this I noticed some wrong results in the regression tests
(like in the alter table test), so I suspect I have some kind of
control flow issue. Perhaps I should fix the resource leak so I can
actually see the failing tests :)

As for your a) and b) above.

Regarding a): We discussed allowing speculative prefetching and
separating the logic for prefetching from actually reading blocks (so
you can prefetch blocks you ultimately don't read). We decided this
may not belong in a streaming read API. What do you think?

Regarding b): We can store per buffer data for anything that actually
goes down through the streaming read API, but, in the index only case,
we don't want the streaming read API to know about blocks that it
doesn't actually need to read.

- Melanie
From f6cb591ba520351ab7f0e7cbf9d6df3dacda6b44 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v3 1/2] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c  |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c|   8 +-
 src/backend/postmaster/checkpointer.c|  15 +-
 src/backend/storage/Makefile |   2 +-
 src/backend/storage/aio/Makefile |  14 +
 src/backend/storage/aio/meson.build  |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++
 src/backend/storage/buffer/bufmgr.c  | 560

Re: index prefetching

2024-01-24 Thread Tomas Vondra


On 1/22/24 07:35, Konstantin Knizhnik wrote:
> 
> On 22/01/2024 1:47 am, Tomas Vondra wrote:
>> h, right. Well, you're right in this case we perhaps could set just one
>> of those flags, but the "purpose" of the two places is quite different.
>>
>> The "prefetch" flag is fully controlled by the prefetcher, and it's up
>> to it to change it (e.g. I can easily imagine some new logic touching
>> setting it to "false" for some reason).
>>
>> The "data" flag is fully controlled by the custom callbacks, so whatever
>> the callback stores, will be there.
>>
>> I don't think it's worth simplifying this. In particular, I don't think
>> the callback can assume it can rely on the "prefetch" flag.
>>
> Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
> cause any extra space overhead (because of alignment), but allows to
> avoid dynamic memory allocation (not sure if it is critical, but nice to
> avoid if possible).
> 

Because it's specific to index-only scans, while IndexPrefetchEntry is a
generic thing, for all places.

However:

(1) Melanie actually presented a very different way to implement this,
relying on the StreamingRead API. So chances are this struct won't
actually be used.

(2) After going through Melanie's patch, I realized this is actually
broken. The IOS case needs to keep more stuff, not just the all-visible
flag, but also the index tuple. Otherwise it'll just operate on the last
tuple read from the index, which happens to be in xs_ituple. Attached is
a patch with a trivial fix.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 5ac954b8d13fbb9419204195c15cc594870e9702 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240124 1/2] Prefetch heap pages during index scans

Index scans are a significant source of random I/O on the indexed heap,
but can't benefit from kernel read-ahead. For bitmap scans that is not
an issue, because they do prefetch explicitly, but for plain index scans
this is a major bottleneck - reading page at a time does not allow
saturating modern storage systems.

This enhances index scans (including index-only scans) to prefetch heap
pages. The scan maintains a queue of future TIDs received from an index,
prefetch the associated heap page, and then eventually pass the TID to
the caller.

To eliminate unnecessary prefetches, a small cache of recent prefetches
is maintained, and the prefetches are skipped. Furthermore, sequential
access patterns are detected and not prefetched, on the assumption that
the kernel read-ahead will do this more efficiently.

These optimizations are best-effort heuristics - we don't know if the
kernel will actually prefetch the pages on it's own, and we can't easily
check that. Moreover, different kernels (and kernel) versions may behave
differently.

Note: For shared buffers we can easily check if a page is cached, and
the PrefetchBuffer() function already takes care of that. These
optimizations are primarily about the page cache.

The prefetching is also disabled for plans that may not be executed only
once - these plans may change direction, interfering with the prefetch
queue. Consider scrollable cursors with backwards scans. This might get
improved to allow the prefetcher to handle direction changes, but it's
not clear if it's worth it.

Note: If a plan changes the scan direction, that inherently wastes the
issued prefetches. If the direction changes often, it likely means a lot
of the pages are still cached. Similarly, if a plan pauses for a long
time, the already prefetched pages may get evicted.
---
 src/backend/commands/explain.c   |  18 +
 src/backend/executor/Makefile|   1 +
 src/backend/executor/execMain.c  |  12 +
 src/backend/executor/execPrefetch.c  | 884 +++
 src/backend/executor/instrument.c|   4 +
 src/backend/executor/nodeIndexonlyscan.c | 113 ++-
 src/backend/executor/nodeIndexscan.c |  68 +-
 src/include/executor/executor.h  |  52 ++
 src/include/executor/instrument.h|   2 +
 src/include/nodes/execnodes.h|  10 +
 src/tools/pgindent/typedefs.list |   3 +
 11 files changed, 1162 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/executor/execPrefetch.c

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 843472e6ddc..b6631cdb2de 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3568,6 +3568,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 		!INSTR_TIME_IS_ZERO(usage->local_blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 	   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
   has_local || has_temp ||
   

Re: index prefetching

2024-01-24 Thread Tomas Vondra
On 1/22/24 08:21, Konstantin Knizhnik wrote:
> 
> On 22/01/2024 1:39 am, Tomas Vondra wrote:
>>> Why we can prefer covering index  to compound index? I see only two good
>>> reasons:
>>> 1. Extra columns type do not  have comparison function need for AM.
>>> 2. The extra columns are never used in query predicate.
>>>
>> Or maybe you don't want to include the columns in a UNIQUE constraint?
>>
> Do you mean that compound index (a,b) can not be used to enforce
> uniqueness of "a"?
> If so, I agree.
> 

Yes.

>>> If you are going to use this columns in query predicates I do not see
>>> much sense in creating inclusive index rather than compound index.
>>> Do you?
>>>
>> But this is also about conditions that can't be translated into index
>> scan keys. Consider this:
>>
>> create table t (a int, b int, c int);
>> insert into t select 1000 * random(), 1000 * random(), 1000 * random()
>> from generate_series(1,100) s(i);
>> create index on t (a,b);
>> vacuum analyze t;
>>
>> explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
>> 111;
>>     QUERY PLAN
>>
>> -
>>   Index Scan using t_a_b_idx on t  (cost=0.42..3670.74 rows=5 width=12)
>> (actual time=4.562..4.564 rows=0 loops=1)
>>     Index Cond: (a = 10)
>>     Filter: (mod(b, 10) = 111)
>>     Rows Removed by Filter: 974
>>     Buffers: shared hit=980
>>     Prefetches: blocks=901
>>   Planning Time: 0.304 ms
>>   Execution Time: 5.146 ms
>> (8 rows)
>>
>> Notice that this still fetched ~1000 buffers in order to evaluate the
>> filter on "b", because it's complex and can't be transformed into a nice
>> scan key.
> 
> O yes.
> Looks like I didn't understand the logic when predicate is included in
> index condition and when not.
> It seems to be natural that only such predicate which specifies some
> range can be included in index condition.
> But it is not the case:
> 
> postgres=# explain select * from t where a = 10 and b in (10,20,30);
>  QUERY PLAN
> -
>  Index Scan using t_a_b_idx on t  (cost=0.42..25.33 rows=3 width=12)
>    Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[])))
> (2 rows)
> 
> So I though ANY predicate using index keys is included in index condition.
> But it is not true (as your example shows).
> 
> But IMHO mod(b,10)=11 or (b+1) < 100 are both quite rare predicates
> this is why I named this use cases "exotic".

Not sure I agree with describing this as "exotic".

The same thing applies to an arbitrary function call. And those are
pretty common in conditions - date_part/date_trunc. Arithmetic
expressions are not that uncommon either. Also, users sometimes have
conditions comparing multiple keys (a 
> In any case, if we have some columns in index tuple it is desired to use
> them for filtering before extracting heap tuple.
> But I afraid it will be not so easy to implement...
> 

I'm not sure what you mean. The patch does that, more or less. There's
issues that need to be solved (e.g. to decide when not to do this), and
how to integrate that into the scan interface (where the quals are
evaluated at the end).

What do you mean when you say "will not be easy to implement"? What
problems do you foresee?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-24 Thread Tomas Vondra
On 1/24/24 01:51, Melanie Plageman wrote:
> On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
>  wrote:
>>
>> On 1/19/24 22:43, Melanie Plageman wrote:
>>
>>> We fill a queue with blocks from TIDs that we fetched from the index.
>>> The queue is saved in a scan descriptor that is made available to the
>>> streaming read callback. Once the queue is full, we invoke the table
>>> AM specific index_fetch_tuple() function which calls
>>> pg_streaming_read_buffer_get_next(). When the streaming read API
>>> invokes the callback we registered, it simply dequeues a block number
>>> for prefetching.
>>
>> So in a way there are two queues in IndexFetchTableData. One (blk_queue)
>> is being filled from IndexNext, and then the queue in StreamingRead.
> 
> I've changed the name from blk_queue to tid_queue to fix the issue you
> mention in your later remarks.
> I suppose there are two queues. The tid_queue is just to pass the
> block requests to the streaming read API. The prefetch distance will
> be the smaller of the two sizes.
> 

FWIW I think the two queues are a nice / elegant approach. In hindsight
my problems with trying to utilize the StreamingRead were due to trying
to use the block-oriented API directly from places that work with TIDs,
and this just makes that go away.

I wonder what the overhead of shuffling stuff between queues will be,
but hopefully not too high (that's my assumption).

>>> The only change to the streaming read API is that now, even if the
>>> callback returns InvalidBlockNumber, we may not be finished, so make
>>> it resumable.
>>
>> Hmm, not sure when can the callback return InvalidBlockNumber before
>> reaching the end. Perhaps for the first index_fetch_heap call? Any
>> reason not to fill the blk_queue before calling index_fetch_heap?
> 
> The callback will return InvalidBlockNumber whenever the queue is
> empty. Let's say your queue size is 5 and your effective prefetch
> distance is 10 (some combination of the PgStreamingReadRange sizes and
> PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
> the callback returns InvalidBlockNumber. Then the tid_queue is filled
> with 5 tids. Then index_fetch_heap() is called.
> pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
> blocks, emptying the queue. Once all 5 have been dequeued, the
> callback will return InvalidBlockNumber.
> pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
> a buffer and save the associated TID in the per_buffer_data. Before
> index_fetch_heap() is called again, we will see that the queue is not
> full and fill it up again with 5 TIDs. So, the callback will return
> InvalidBlockNumber 3 times in this scenario.
> 

Thanks for the explanation. Yes, I didn't realize that the queues may be
of different length, at which point it makes sense to return invalid
block to signal the TID queue is empty.

>>> Structurally, this changes the timing of when the heap blocks are
>>> prefetched. Your code would get a tid from the index and then prefetch
>>> the heap block -- doing this until it filled a queue that had the
>>> actual tids saved in it. With my approach and the streaming read API,
>>> you fetch tids from the index until you've filled up a queue of block
>>> numbers. Then the streaming read API will prefetch those heap blocks.
>>
>> And is that a good/desirable change? I'm not saying it's not, but maybe
>> we should not be filling either queue in one go - we don't want to
>> overload the prefetching.
> 
> We can focus on the prefetch distance algorithm maintained in the
> streaming read API and then make sure that the tid_queue is larger
> than the desired prefetch distance maintained by the streaming read
> API.
> 

Agreed. I think I wasn't quite right when concerned about "overloading"
the prefetch, because that depends entirely on the StreamingRead API
queue. A lage TID queue can't cause overload of anything.

What could happen is a TID queue being too small, so the prefetch can't
hit the target distance. But that can happen already, e.g. indexes that
are correlated and/or index-only scans with all-visible pages.

>>> There are also table AM layering violations in my sketch which would
>>> have to be worked out (not to mention some resource leakage I didn't
>>> bother investigating [which causes it to fail tests]).
>>>
>>> 0001 is all of Thomas' streaming read API code that isn't yet in
>>> master and 0002 is my rough sketch of index prefetching using the
>>> streaming read API
>>>
>>> There are also numerous optimizations that your index prefetching
>>

Re: index prefetching

2024-01-23 Thread Melanie Plageman
On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
 wrote:
>
> On 1/19/24 22:43, Melanie Plageman wrote:
>
> > We fill a queue with blocks from TIDs that we fetched from the index.
> > The queue is saved in a scan descriptor that is made available to the
> > streaming read callback. Once the queue is full, we invoke the table
> > AM specific index_fetch_tuple() function which calls
> > pg_streaming_read_buffer_get_next(). When the streaming read API
> > invokes the callback we registered, it simply dequeues a block number
> > for prefetching.
>
> So in a way there are two queues in IndexFetchTableData. One (blk_queue)
> is being filled from IndexNext, and then the queue in StreamingRead.

I've changed the name from blk_queue to tid_queue to fix the issue you
mention in your later remarks.
I suppose there are two queues. The tid_queue is just to pass the
block requests to the streaming read API. The prefetch distance will
be the smaller of the two sizes.

> > The only change to the streaming read API is that now, even if the
> > callback returns InvalidBlockNumber, we may not be finished, so make
> > it resumable.
>
> Hmm, not sure when can the callback return InvalidBlockNumber before
> reaching the end. Perhaps for the first index_fetch_heap call? Any
> reason not to fill the blk_queue before calling index_fetch_heap?

The callback will return InvalidBlockNumber whenever the queue is
empty. Let's say your queue size is 5 and your effective prefetch
distance is 10 (some combination of the PgStreamingReadRange sizes and
PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
the callback returns InvalidBlockNumber. Then the tid_queue is filled
with 5 tids. Then index_fetch_heap() is called.
pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
blocks, emptying the queue. Once all 5 have been dequeued, the
callback will return InvalidBlockNumber.
pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
a buffer and save the associated TID in the per_buffer_data. Before
index_fetch_heap() is called again, we will see that the queue is not
full and fill it up again with 5 TIDs. So, the callback will return
InvalidBlockNumber 3 times in this scenario.

> > Structurally, this changes the timing of when the heap blocks are
> > prefetched. Your code would get a tid from the index and then prefetch
> > the heap block -- doing this until it filled a queue that had the
> > actual tids saved in it. With my approach and the streaming read API,
> > you fetch tids from the index until you've filled up a queue of block
> > numbers. Then the streaming read API will prefetch those heap blocks.
>
> And is that a good/desirable change? I'm not saying it's not, but maybe
> we should not be filling either queue in one go - we don't want to
> overload the prefetching.

We can focus on the prefetch distance algorithm maintained in the
streaming read API and then make sure that the tid_queue is larger
than the desired prefetch distance maintained by the streaming read
API.

> > I didn't actually implement the block queue -- I just saved a single
> > block number and pretended it was a block queue. I was imagining we
> > replace this with something like your IndexPrefetch->blockItems --
> > which has light deduplication. We'd probably have to flesh it out more
> > than that.
>
> I don't understand how this passes the TID to the index_fetch_heap.
> Isn't it working only by accident, due to blk_queue only having a single
> entry? Shouldn't the first queue (blk_queue) store TIDs instead?

Oh dear! Fixed in the attached v2. I've replaced the single
BlockNumber with a single ItemPointerData. I will work on implementing
an actual queue next week.

> > There are also table AM layering violations in my sketch which would
> > have to be worked out (not to mention some resource leakage I didn't
> > bother investigating [which causes it to fail tests]).
> >
> > 0001 is all of Thomas' streaming read API code that isn't yet in
> > master and 0002 is my rough sketch of index prefetching using the
> > streaming read API
> >
> > There are also numerous optimizations that your index prefetching
> > patch set does that would need to be added in some way. I haven't
> > thought much about it yet. I wanted to see what you thought of this
> > approach first. Basically, is it workable?
>
> It seems workable, yes. I'm not sure it's much simpler than my patch
> (considering a lot of the code is in the optimizations, which are
> missing from this patch).
>
> I think the question is where should the optimizations happen. I suppose
> some of them might/should happen in the StreamingRead API itself - like
> the detection of sequential patt

Re: index prefetching

2024-01-23 Thread Tomas Vondra
On 1/19/24 22:43, Melanie Plageman wrote:
> On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
>  wrote:
>>
>> On 1/9/24 21:31, Robert Haas wrote:
>>> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
>>>  wrote:
>>>> Here's a somewhat reworked version of the patch. My initial goal was to
>>>> see if it could adopt the StreamingRead API proposed in [1], but that
>>>> turned out to be less straight-forward than I hoped, for two reasons:
>>>
>>> I guess we need Thomas or Andres or maybe Melanie to comment on this.
>>>
>>
>> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
>> streaming I/O stuff.
> 
> I've been studying your patch with the intent of finding a way to
> change it and or the streaming read API to work together. I've
> attached a very rough sketch of how I think it could work.
> 

Thanks.

> We fill a queue with blocks from TIDs that we fetched from the index.
> The queue is saved in a scan descriptor that is made available to the
> streaming read callback. Once the queue is full, we invoke the table
> AM specific index_fetch_tuple() function which calls
> pg_streaming_read_buffer_get_next(). When the streaming read API
> invokes the callback we registered, it simply dequeues a block number
> for prefetching.

So in a way there are two queues in IndexFetchTableData. One (blk_queue)
is being filled from IndexNext, and then the queue in StreamingRead.

> The only change to the streaming read API is that now, even if the
> callback returns InvalidBlockNumber, we may not be finished, so make
> it resumable.
> 

Hmm, not sure when can the callback return InvalidBlockNumber before
reaching the end. Perhaps for the first index_fetch_heap call? Any
reason not to fill the blk_queue before calling index_fetch_heap?


> Structurally, this changes the timing of when the heap blocks are
> prefetched. Your code would get a tid from the index and then prefetch
> the heap block -- doing this until it filled a queue that had the
> actual tids saved in it. With my approach and the streaming read API,
> you fetch tids from the index until you've filled up a queue of block
> numbers. Then the streaming read API will prefetch those heap blocks.
> 

And is that a good/desirable change? I'm not saying it's not, but maybe
we should not be filling either queue in one go - we don't want to
overload the prefetching.

> I didn't actually implement the block queue -- I just saved a single
> block number and pretended it was a block queue. I was imagining we
> replace this with something like your IndexPrefetch->blockItems --
> which has light deduplication. We'd probably have to flesh it out more
> than that.
> 

I don't understand how this passes the TID to the index_fetch_heap.
Isn't it working only by accident, due to blk_queue only having a single
entry? Shouldn't the first queue (blk_queue) store TIDs instead?

> There are also table AM layering violations in my sketch which would
> have to be worked out (not to mention some resource leakage I didn't
> bother investigating [which causes it to fail tests]).
> 
> 0001 is all of Thomas' streaming read API code that isn't yet in
> master and 0002 is my rough sketch of index prefetching using the
> streaming read API
> 
> There are also numerous optimizations that your index prefetching
> patch set does that would need to be added in some way. I haven't
> thought much about it yet. I wanted to see what you thought of this
> approach first. Basically, is it workable?
> 

It seems workable, yes. I'm not sure it's much simpler than my patch
(considering a lot of the code is in the optimizations, which are
missing from this patch).

I think the question is where should the optimizations happen. I suppose
some of them might/should happen in the StreamingRead API itself - like
the detection of sequential patterns, recently prefetched blocks, ...

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.


The main


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-21 Thread Konstantin Knizhnik


On 22/01/2024 1:39 am, Tomas Vondra wrote:

Why we can prefer covering index  to compound index? I see only two good
reasons:
1. Extra columns type do not  have comparison function need for AM.
2. The extra columns are never used in query predicate.


Or maybe you don't want to include the columns in a UNIQUE constraint?

Do you mean that compound index (a,b) can not be used to enforce 
uniqueness of "a"?

If so, I agree.


If you are going to use this columns in query predicates I do not see
much sense in creating inclusive index rather than compound index.
Do you?


But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,100) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
111;
QUERY PLAN

-
  Index Scan using t_a_b_idx on t  (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)
Index Cond: (a = 10)
Filter: (mod(b, 10) = 111)
Rows Removed by Filter: 974
Buffers: shared hit=980
Prefetches: blocks=901
  Planning Time: 0.304 ms
  Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key.


O yes.
Looks like I didn't understand the logic when predicate is included in 
index condition and when not.
It seems to be natural that only such predicate which specifies some 
range can be included in index condition.

But it is not the case:

postgres=# explain select * from t where a = 10 and b in (10,20,30);
 QUERY PLAN
-
 Index Scan using t_a_b_idx on t  (cost=0.42..25.33 rows=3 width=12)
   Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[])))
(2 rows)

So I though ANY predicate using index keys is included in index condition.
But it is not true (as your example shows).

But IMHO mod(b,10)=11 or (b+1) < 100 are both quite rare predicates 
this is why I named this use cases "exotic".


In any case, if we have some columns in index tuple it is desired to use 
them for filtering before extracting heap tuple.

But I afraid it will be not so easy to implement...



Re: index prefetching

2024-01-21 Thread Konstantin Knizhnik


On 22/01/2024 1:47 am, Tomas Vondra wrote:

h, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.

Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not 
cause any extra space overhead (because of alignment), but allows to 
avoid dynamic memory allocation (not sure if it is critical, but nice to 
avoid if possible).




Re: index prefetching

2024-01-21 Thread Peter Smith
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
like there were  CFbot test failures last time it was run [2]. Please
have a look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/4351/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4351

Kind Regards,
Peter Smith.




Re: index prefetching

2024-01-21 Thread Tomas Vondra



On 1/21/24 20:56, Konstantin Knizhnik wrote:
> 
> On 19/01/2024 2:35 pm, Tomas Vondra wrote:
>>
>> On 1/19/24 09:34, Konstantin Knizhnik wrote:
>>> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
 On 1/17/24 09:45, Konstantin Knizhnik wrote:
> I have integrated your prefetch patch in Neon and it actually works!
> Moreover, I combined it with prefetch of leaf pages for IOS and it
> also
> seems to work.
>
 Cool! And do you think this is the right design/way to do this?
>>> I like the idea of prefetching TIDs in executor.
>>>
>>> But looking though your patch I have some questions:
>>>
>>>
>>> 1. Why it is necessary to allocate and store all_visible flag in data
>>> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
>>>
>>> +        /* store the all_visible flag in the private part of the
>>> entry */
>>> +        entry->data = palloc(sizeof(bool));
>>> +        *(bool *) entry->data = all_visible;
>>>
>> What you mean by "prefetch field"?
> 
> 
> I mean "prefetch" field of IndexPrefetchEntry:
> 
> +
> +typedef struct IndexPrefetchEntry
> +{
> +    ItemPointerData tid;
> +
> +    /* should we prefetch heap page for this TID? */
> +    bool        prefetch;
> +
> 
> You store the same flag twice:
> 
> +        /* prefetch only if not all visible */
> +        entry->prefetch = !all_visible;
> +
> +        /* store the all_visible flag in the private part of the entry */
> +        entry->data = palloc(sizeof(bool));
> +        *(bool *) entry->data = all_visible;
> 
> My question was: why do we need to allocate something in entry->data and
> store all_visible in it, while we already stored !all-visible in
> entry->prefetch.
> 

Ah, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-21 Thread Tomas Vondra



On 1/21/24 20:50, Konstantin Knizhnik wrote:
> 
> On 20/01/2024 12:14 am, Tomas Vondra wrote:
>> Looks like I was not true, even if it is not index-only scan but index
>>> condition involves only index attributes, then heap is not accessed
>>> until we find tuple satisfying search condition.
>>> Inclusive index case described above
>>> (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
>>> exotic case. If keys are actually used in search, then why not to create
>>> normal compound index instead?
>>>
>> Not sure I follow ...
>>
>> Firstly, I'm not convinced the example addressed by that other patch is
>> that exotic. IMHO it's quite possible it's actually quite common, but
>> the users do no realize the possible gains.
>>
>> Also, there are reasons to not want very wide indexes - it has overhead
>> associated with maintenance, disk space, etc. I think it's perfectly
>> rational to design indexes in a way eliminates most heap fetches
>> necessary to evaluate conditions, but does not guarantee IOS (so the
>> last heap fetch is still needed).
> 
> We are comparing compound index (a,b) and covering (inclusive) index (a)
> include (b)
> This indexes have exactly the same width and size and almost the same
> maintenance overhead.
> 
> First index has more expensive comparison function (involving two
> columns)  but I do not think that it can significantly affect
> performance and maintenance cost. Also if selectivity of "a" is good
> enough, then there is no need to compare "b"
> 
> Why we can prefer covering index  to compound index? I see only two good
> reasons:
> 1. Extra columns type do not  have comparison function need for AM.
> 2. The extra columns are never used in query predicate.
> 

Or maybe you don't want to include the columns in a UNIQUE constraint?

> If you are going to use this columns in query predicates I do not see
> much sense in creating inclusive index rather than compound index.
> Do you?
> 

But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,100) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
111;
   QUERY PLAN

-
 Index Scan using t_a_b_idx on t  (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)
   Index Cond: (a = 10)
   Filter: (mod(b, 10) = 111)
   Rows Removed by Filter: 974
   Buffers: shared hit=980
   Prefetches: blocks=901
 Planning Time: 0.304 ms
 Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key. Or this:

explain (analyze, buffers) select a from t where a = 10 and (b+1) < 100
 and c < 0;


   QUERY PLAN

 Index Scan using t_a_b_idx on t  (cost=0.42..3673.22 rows=1 width=4)
(actual time=4.446..4.448 rows=0 loops=1)
   Index Cond: (a = 10)
   Filter: ((c < 0) AND ((b + 1) < 100))
   Rows Removed by Filter: 974
   Buffers: shared hit=980
   Prefetches: blocks=901
 Planning Time: 0.313 ms
 Execution Time: 4.878 ms
(8 rows)

where it's "broken" by the extra unindexed column.

FWIW there are the primary cases I had in mind for this patch.


> 
>> What do you mean by "create normal compound index"? The patch addresses
>> a limitation that not every condition can be translated into a proper
>> scan key. Even if we improve this, there will always be such conditions.
>> The the IOS can evaluate them on index tuple, the regular index scan
>> can't do that (currently).
>>
>> Can you share an example demonstrating the alternative approach?
> 
> May be I missed something.
> 
> This is the example from
> https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me
>  :
> 
> ```
> 
> And here is the plan with index on (a,b).
> 
> Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884
> rows=0 loops=1)    Output: a, b, d    Buffers: shared hit=613    ->
> Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1
> width=12) (actual time=6.880..6.881 rows=0 loops=1)      Output: a,
> b, d  Index Cond: ((t.a > 100) AND (t.b = 4))  
>    Buffers: shared hit=613 Planning:    Buffers: shared hit=41 Planning
> Time: 0.314 ms Execution Time: 6.910 ms ```
> 
> 
> Isn't it an optimal plan for this query?
> 
> And cite from self reproducible example 

Re: index prefetching

2024-01-21 Thread Konstantin Knizhnik



On 19/01/2024 2:35 pm, Tomas Vondra wrote:


On 1/19/24 09:34, Konstantin Knizhnik wrote:

On 18/01/2024 6:00 pm, Tomas Vondra wrote:

On 1/17/24 09:45, Konstantin Knizhnik wrote:

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also
seems to work.


Cool! And do you think this is the right design/way to do this?

I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:


1. Why it is necessary to allocate and store all_visible flag in data
buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?

+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;


What you mean by "prefetch field"?



I mean "prefetch" field of IndexPrefetchEntry:

+
+typedef struct IndexPrefetchEntry
+{
+    ItemPointerData tid;
+
+    /* should we prefetch heap page for this TID? */
+    bool        prefetch;
+

You store the same flag twice:

+        /* prefetch only if not all visible */
+        entry->prefetch = !all_visible;
+
+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

My question was: why do we need to allocate something in entry->data and 
store all_visible in it, while we already stored !all-visible in 
entry->prefetch.







Re: index prefetching

2024-01-21 Thread Konstantin Knizhnik


On 20/01/2024 12:14 am, Tomas Vondra wrote:

Looks like I was not true, even if it is not index-only scan but index

condition involves only index attributes, then heap is not accessed
until we find tuple satisfying search condition.
Inclusive index case described above
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
exotic case. If keys are actually used in search, then why not to create
normal compound index instead?


Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).


We are comparing compound index (a,b) and covering (inclusive) index (a) 
include (b)
This indexes have exactly the same width and size and almost the same 
maintenance overhead.


First index has more expensive comparison function (involving two 
columns)  but I do not think that it can significantly affect
performance and maintenance cost. Also if selectivity of "a" is good 
enough, then there is no need to compare "b"


Why we can prefer covering index  to compound index? I see only two good 
reasons:

1. Extra columns type do not  have comparison function need for AM.
2. The extra columns are never used in query predicate.

If you are going to use this columns in query predicates I do not see 
much sense in creating inclusive index rather than compound index.

Do you?



What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?


May be I missed something.

This is the example from 
https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me 
:


```

And here is the plan with index on (a,b).

Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 
rows=0 loops=1)    Output: a, b, d    Buffers: shared hit=613    -> 
Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1 
width=12) (actual time=6.880..6.881 rows=0 loops=1)      Output: a, 
b, d  Index Cond: ((t.a > 100) AND (t.b = 4))   
   Buffers: shared hit=613 Planning:    Buffers: shared hit=41 Planning 
Time: 0.314 ms Execution Time: 6.910 ms ```



Isn't it an optimal plan for this query?

And cite from self reproducible example https://dbfiddle.uk/iehtq44L :
```
create unique index t_a_include_b on t(a) include (b);
-- I'd expecd index above to behave the same as index below for this query
--create unique index on t(a,b);
```

I agree that it is natural to expect the same result for both indexes. 
So this PR definitely makes sense.
My point is only that compound index (a,b) in this case is more natural 
and preferable.




Re: index prefetching

2024-01-19 Thread Tomas Vondra



On 1/19/24 16:19, Konstantin Knizhnik wrote:
> 
> On 18/01/2024 5:57 pm, Tomas Vondra wrote:
>> On 1/16/24 21:10, Konstantin Knizhnik wrote:
>>> ...
>>>
 4. I think that performing prefetch at executor level is really great
> idea and so prefetch can be used by all indexes, including custom
> indexes. But prefetch will be efficient only if index can provide fast
> access to next TID (located at the same page). I am not sure that
> it is
> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
> custom AM. I wonder if we should extend AM API to make index make a
> decision weather to perform prefetch of TIDs or not.
 I'm not against having a flag to enable/disable prefetching, but the
 question is whether doing prefetching for such indexes can be harmful.
 I'm not sure about that.
>>> I tend to agree with you - it is hard to imagine index implementation
>>> which doesn't win from prefetching heap pages.
>>> May be only the filtering case you have mentioned. But it seems to me
>>> that current B-Tree index scan (not IOS) implementation in Postgres
>>> doesn't try to use index tuple to check extra condition - it will fetch
>>> heap tuple in any case.
>>>
>> That's true, but that's why I started working on this:
>>
>> https://commitfest.postgresql.org/46/4352/
>>
>> I need to think about how to combine that with the prefetching. The good
>> thing is that both changes require fetching TIDs, not slots. I think the
>> condition can be simply added to the prefetch callback.
>>
>>
>> regards
>>
> Looks like I was not true, even if it is not index-only scan but index
> condition involves only index attributes, then heap is not accessed
> until we find tuple satisfying search condition.
> Inclusive index case described above
> (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
> exotic case. If keys are actually used in search, then why not to create
> normal compound index instead?
> 

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-19 Thread Melanie Plageman
On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
 wrote:
>
> On 1/9/24 21:31, Robert Haas wrote:
> > On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> >  wrote:
> >> Here's a somewhat reworked version of the patch. My initial goal was to
> >> see if it could adopt the StreamingRead API proposed in [1], but that
> >> turned out to be less straight-forward than I hoped, for two reasons:
> >
> > I guess we need Thomas or Andres or maybe Melanie to comment on this.
> >
>
> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
> streaming I/O stuff.

I've been studying your patch with the intent of finding a way to
change it and or the streaming read API to work together. I've
attached a very rough sketch of how I think it could work.

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching. The only change to the streaming read API is that
now, even if the callback returns InvalidBlockNumber, we may not be
finished, so make it resumable.

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

I didn't actually implement the block queue -- I just saved a single
block number and pretended it was a block queue. I was imagining we
replace this with something like your IndexPrefetch->blockItems --
which has light deduplication. We'd probably have to flesh it out more
than that.

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

- Melanie
From 31a0b829b3aca31542dc3236b408f1e86133aea7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 19 Jan 2024 16:10:30 -0500
Subject: [PATCH v1 2/2] use streaming reads in index scan

---
 src/backend/access/heap/heapam_handler.c | 14 +++-
 src/backend/access/index/indexam.c   |  2 +
 src/backend/executor/nodeIndexscan.c | 83 
 src/backend/storage/aio/streaming_read.c | 10 ++-
 src/include/access/relscan.h |  6 ++
 src/include/storage/streaming_read.h |  2 +
 6 files changed, 101 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c 
b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..0ef5f824546 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -127,9 +127,17 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
/* Switch to correct buffer if we don't have it already */
Buffer  prev_buf = hscan->xs_cbuf;
 
-   hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
-   
  hscan->xs_base.rel,
-   
  ItemPointerGetBlockNumber(tid));
+   if (scan->pgsr)
+   {
+   hscan->xs_cbuf = 
pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+   if (!BufferIsValid(hscan->xs_cbuf))
+   return false;
+   }
+   else
+   hscan->xs_cbuf = ReleaseAndReadBuffer(hscan->xs_cbuf,
+   
hscan->xs_base.rel,
+   
ItemPointerGetBlockNumber(tid));
+
 
/*
 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/index/indexam.c 
b/src/backend/access/index/indexam.c
index 63dff101e29..c118cc3861f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -2

Re: index prefetching

2024-01-19 Thread Konstantin Knizhnik


On 18/01/2024 5:57 pm, Tomas Vondra wrote:

On 1/16/24 21:10, Konstantin Knizhnik wrote:

...


4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch
heap tuple in any case.


That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.


regards

Looks like I was not true, even if it is not index-only scan but index 
condition involves only index attributes, then heap is not accessed 
until we find tuple satisfying search condition.
Inclusive index case described above 
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO 
exotic case. If keys are actually used in search, then why not to create 
normal compound index instead?




Re: index prefetching

2024-01-19 Thread Tomas Vondra



On 1/19/24 09:34, Konstantin Knizhnik wrote:
> 
> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
>> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>>> I have integrated your prefetch patch in Neon and it actually works!
>>> Moreover, I combined it with prefetch of leaf pages for IOS and it also
>>> seems to work.
>>>
>> Cool! And do you think this is the right design/way to do this?
> 
> I like the idea of prefetching TIDs in executor.
> 
> But looking though your patch I have some questions:
> 
> 
> 1. Why it is necessary to allocate and store all_visible flag in data
> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
> 
> +        /* store the all_visible flag in the private part of the entry */
> +        entry->data = palloc(sizeof(bool));
> +        *(bool *) entry->data = all_visible;
> 

What you mean by "prefetch field"? The reason why it's done like this is
to only do the VM check once - without keeping the value, we'd have to
do it in the "next" callback, to determine if we need to prefetch the
heap tuple, and then later in the index-only scan itself. That's a
significant overhead, especially in the case when everything is visible.

> 2. Names of the functions `IndexPrefetchNext` and
> `IndexOnlyPrefetchNext` are IMHO confusing because they look similar and
> one can assume that for one is used for normal index scan and last one -
> for index only scan. But actually `IndexOnlyPrefetchNext` is callback
> and `IndexPrefetchNext` is used in both nodeIndexscan.c and
> nodeIndexonlyscan.c
> 

Yeah, that's a good point. The naming probably needs rethinking.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-19 Thread Konstantin Knizhnik



On 18/01/2024 6:00 pm, Tomas Vondra wrote:

On 1/17/24 09:45, Konstantin Knizhnik wrote:

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also
seems to work.


Cool! And do you think this is the right design/way to do this?


I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:


1. Why it is necessary to allocate and store all_visible flag in data 
buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?


+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

2. Names of the functions `IndexPrefetchNext` and 
`IndexOnlyPrefetchNext` are IMHO confusing because they look similar and 
one can assume that for one is used for normal index scan and last one - 
for index only scan. But actually `IndexOnlyPrefetchNext` is callback 
and `IndexPrefetchNext` is used in both nodeIndexscan.c and 
nodeIndexonlyscan.c







Re: index prefetching

2024-01-18 Thread Tomas Vondra
On 1/17/24 09:45, Konstantin Knizhnik wrote:
> I have integrated your prefetch patch in Neon and it actually works!
> Moreover, I combined it with prefetch of leaf pages for IOS and it also
> seems to work.
> 

Cool! And do you think this is the right design/way to do this?

> Just small notice: you are reporting `blks_prefetch_rounds` in explain,
> but it is not incremented anywhere.
> Moreover, I do not precisely understand what it mean and wonder if such
> information is useful for analyzing query executing plan.
> Also your patch always report number of prefetched blocks (and rounds)
> if them are not zero.
> 

Right, this needs fixing.

> I think that adding new information to explain it may cause some
> problems because there are a lot of different tools which parse explain
> report to visualize it,
> make some recommendations top improve performance, ... Certainly good
> practice for such tools is to ignore all unknown tags. But I am not sure
> that everybody follow this practice.
> It seems to be more safe and at the same time convenient for users to
> add extra tag to explain to enable/disable prefetch info (as it was done
> in Neon).
> 

I think we want to add this info to explain, but maybe it should be
behind a new flag and disabled by default.

> Here we come back to my custom explain patch;) Actually using it is not
> necessary. You can manually add "prefetch" option to Postgres core (as
> it is currently done in Neon).
> 

Yeah, I think that's the right solution.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-18 Thread Tomas Vondra
On 1/16/24 21:10, Konstantin Knizhnik wrote:
> 
> ...
> 
>> 4. I think that performing prefetch at executor level is really great
>>> idea and so prefetch can be used by all indexes, including custom
>>> indexes. But prefetch will be efficient only if index can provide fast
>>> access to next TID (located at the same page). I am not sure that it is
>>> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
>>> custom AM. I wonder if we should extend AM API to make index make a
>>> decision weather to perform prefetch of TIDs or not.
>> I'm not against having a flag to enable/disable prefetching, but the
>> question is whether doing prefetching for such indexes can be harmful.
>> I'm not sure about that.
> 
> I tend to agree with you - it is hard to imagine index implementation
> which doesn't win from prefetching heap pages.
> May be only the filtering case you have mentioned. But it seems to me
> that current B-Tree index scan (not IOS) implementation in Postgres
> doesn't try to use index tuple to check extra condition - it will fetch
> heap tuple in any case.
> 

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-17 Thread Konstantin Knizhnik

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also 
seems to work.


Just small notice: you are reporting `blks_prefetch_rounds` in explain, 
but it is not incremented anywhere.
Moreover, I do not precisely understand what it mean and wonder if such 
information is useful for analyzing query executing plan.
Also your patch always report number of prefetched blocks (and rounds) 
if them are not zero.


I think that adding new information to explain it may cause some 
problems because there are a lot of different tools which parse explain 
report to visualize it,
make some recommendations top improve performance, ... Certainly good 
practice for such tools is to ignore all unknown tags. But I am not sure 
that everybody follow this practice.
It seems to be more safe and at the same time convenient for users to 
add extra tag to explain to enable/disable prefetch info (as it was done 
in Neon).


Here we come back to my custom explain patch;) Actually using it is not 
necessary. You can manually add "prefetch" option to Postgres core (as 
it is currently done in Neon).


Best regards,
Konstantin





Re: index prefetching

2024-01-17 Thread Konstantin Knizhnik



On 16/01/2024 11:58 pm, Jim Nasby wrote:

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
Amazon RDS is just vanilla Postgres with file system mounted on EBS 
(Amazon  distributed file system).
EBS provides good throughput but larger latencies comparing with 
local SSDs.

I am not sure if read-ahead works for EBS.


Actually, EBS only provides a block device - it's definitely not a 
filesystem itself (*EFS* is a filesystem - but it's also significantly 
different than EBS). So as long as readahead is happening somewheer 
above the block device I would expect it to JustWork on EBS.



Thank you for clarification.
Yes, EBS is just block device and read-ahead can be used fir it as for 
any other local device.
There is actually recommendation to increase read-ahead for EBS device 
to reach better performance on some workloads:


https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

So looks like for sequential access pattern manual prefetching at EBS is 
not needed.
But at Neon situation is quite different. May be Aurora Postgres is 
using some other mechanism for speed-up vacuum and seqscan,

but Neon is using Postgres prefetch mechanism for it.





Re: index prefetching

2024-01-16 Thread Konstantin Knizhnik



On 16/01/2024 11:58 pm, Jim Nasby wrote:

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
Amazon RDS is just vanilla Postgres with file system mounted on EBS 
(Amazon  distributed file system).
EBS provides good throughput but larger latencies comparing with 
local SSDs.

I am not sure if read-ahead works for EBS.


Actually, EBS only provides a block device - it's definitely not a 
filesystem itself (*EFS* is a filesystem - but it's also significantly 
different than EBS). So as long as readahead is happening somewheer 
above the block device I would expect it to JustWork on EBS.



Thank you for clarification.
Yes, EBS is just block device and read-ahead can be used fir it as for 
any other local device.
There is actually recommendation to increase read-ahead for EBS device 
to reach better performance on some workloads:


https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

So looks like for sequential access pattern manual prefetching at EBS is 
not needed.
But at Neon situation is quite different. May be Aurora Postgres is 
using some other mechanism for speed-up vacuum and seqscan,

but Neon is using Postgres prefetch mechanism for it.





Re: index prefetching

2024-01-16 Thread Jim Nasby

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
Amazon RDS is just vanilla Postgres with file system mounted on EBS 
(Amazon  distributed file system).

EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.


Actually, EBS only provides a block device - it's definitely not a 
filesystem itself (*EFS* is a filesystem - but it's also significantly 
different than EBS). So as long as readahead is happening somewheer 
above the block device I would expect it to JustWork on EBS.


Of course, Aurora Postgres (like Neon) is completely different. If you 
look at page 53 of [1] you'll note that there's two different terms 
used: prefetch and batch. I'm not sure how much practical difference 
there is, but batched IO (one IO request to Aurora Storage for many 
blocks) predates index prefetch; VACUUM in APG has used batched IO for a 
very long time (it also *only* reads blocks that aren't marked all 
visble/frozen; none of the "only skip if skipping at least 32 blocks" 
logic is used).


1: 
https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deep_dive_on_Amazon_Aurora_with_PostgreSQL_compatibility_DAT328-R1.pdf

--
Jim Nasby, Data Architect, Austin TX





Re: index prefetching

2024-01-16 Thread Konstantin Knizhnik


On 16/01/2024 6:25 pm, Tomas Vondra wrote:

On 1/16/24 09:13, Konstantin Knizhnik wrote:

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far
as Neon seaprates compute and storage, prefetch is much more critical
for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX
MERGE), motivating it that in this case OS read-ahead will be more
efficient than prefetch. It may be true for normal storage devices, bit
not for Neon storage and may be also for Postgres on top of DFS (i.e.
Amazon RDS). I wonder if we can delegate decision whether to perform
prefetch in this case or not to some other level. I do not know
precisely where is should be handled. The best candidate IMHO is
storager manager. But it most likely requires extension of SMGR API. Not
sure if you want to do it... Straightforward solution is to move this
logic to some callback, which can be overwritten by user.


Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?


Amazon RDS is just vanilla Postgres with file system mounted on EBS 
(Amazon  distributed file system).

EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.




4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.


I tend to agree with you - it is hard to imagine index implementation 
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me 
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch 
heap tuple in any case.



5. Minor notice: there are few places where index_getnext_slot is called
with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this
places corresponds to "point loopkup", i.e. unique constraint check,
find replication tuple by index... Prefetch seems to be unlikely useful
here, unlkess there is index bloating and and we have to skip a lot of
tuples before locating right one. But should we try to optimize case of
bloated indexes?


Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).

Sorry, I looked at v20240103-0001-prefetch-2023-12-09.patch , I didn't 
noticed v20240112-0001-Prefetch-heap-pages-during-index-scans.patch




regards


Re: index prefetching

2024-01-16 Thread Robert Haas
On Tue, Jan 16, 2024 at 11:25 AM Tomas Vondra
 wrote:
> > 3. It doesn't perform prefetch of leave pages for IOS, only referenced
> > heap pages which are not marked as all-visible. It seems to me that if
> > optimized has chosen IOS (and not bitmap heap scan for example), then
> > there should be large enough fraction for all-visible pages. Also index
> > prefetch is most efficient for OLAp queries and them are used to be
> > performance for historical data which is all-visible. But IOS can be
> > really handled separately in some other PR. Frankly speaking combining
> > prefetch of leave B-Tree pages and referenced heap pages seems to be
> > very challenged task.
>
> I see prefetching of leaf pages as interesting / worthwhile improvement,
> but out of scope for this patch. I don't think it can be done at the
> executor level - the prefetch requests need to be submitted from the
> index AM code (by calling PrefetchBuffer, etc.)

+1. This is a good feature, and so is that, but they're not the same
feature, despite the naming problems.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: index prefetching

2024-01-16 Thread Tomas Vondra
On 1/16/24 09:13, Konstantin Knizhnik wrote:
> Hi,
> 
> On 12/01/2024 6:42 pm, Tomas Vondra wrote:
>> Hi,
>>
>> Here's an improved version of this patch, finishing a lot of the stuff
>> that I alluded to earlier - moving the code from indexam.c, renaming a
>> bunch of stuff, etc. I've also squashed it into a single patch, to make
>> it easier to review.
> 
> I am thinking about testing you patch with Neon (cloud Postgres). As far
> as Neon seaprates compute and storage, prefetch is much more critical
> for Neon
> architecture than for vanilla Postgres.
> 
> I have few complaints:
> 
> 1. It disables prefetch for sequential access pattern (i.e. INDEX
> MERGE), motivating it that in this case OS read-ahead will be more
> efficient than prefetch. It may be true for normal storage devices, bit
> not for Neon storage and may be also for Postgres on top of DFS (i.e.
> Amazon RDS). I wonder if we can delegate decision whether to perform
> prefetch in this case or not to some other level. I do not know
> precisely where is should be handled. The best candidate IMHO is
> storager manager. But it most likely requires extension of SMGR API. Not
> sure if you want to do it... Straightforward solution is to move this
> logic to some callback, which can be overwritten by user.
> 

Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?

However, I'm not against making this modular / wrapping this in some
sort of callbacks, for example.

> 2. It disables prefetch for direct_io. It seems to be even more obvious
> than 1), because prefetching using `posix_fadvise` definitely not
> possible in case of using direct_io. But in theory if SMGR provides some
> alternative prefetch implementation (as in case of Neon), this also may
> be not true. Still unclear why we can want to use direct_io in Neon...
> But still I prefer to mo.ve this decision outside executor.
> 

True. I think this would / should be customizable by the callback.

> 3. It doesn't perform prefetch of leave pages for IOS, only referenced
> heap pages which are not marked as all-visible. It seems to me that if
> optimized has chosen IOS (and not bitmap heap scan for example), then
> there should be large enough fraction for all-visible pages. Also index
> prefetch is most efficient for OLAp queries and them are used to be
> performance for historical data which is all-visible. But IOS can be
> really handled separately in some other PR. Frankly speaking combining
> prefetch of leave B-Tree pages and referenced heap pages seems to be
> very challenged task.
> 

I see prefetching of leaf pages as interesting / worthwhile improvement,
but out of scope for this patch. I don't think it can be done at the
executor level - the prefetch requests need to be submitted from the
index AM code (by calling PrefetchBuffer, etc.)

> 4. I think that performing prefetch at executor level is really great
> idea and so prefetch can be used by all indexes, including custom
> indexes. But prefetch will be efficient only if index can provide fast
> access to next TID (located at the same page). I am not sure that it is
> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
> custom AM. I wonder if we should extend AM API to make index make a
> decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

> 
> 5. Minor notice: there are few places where index_getnext_slot is called
> with last NULL parameter (disabled prefetch) with the following comment
> "XXX Would be nice to also benefit from prefetching here." But all this
> places corresponds to "point loopkup", i.e. unique constraint check,
> find replication tuple by index... Prefetch seems to be unlikely useful
> here, unlkess there is index bloating and and we have to skip a lot of
> tuples before locating right one. But should we try to optimize case of
> bloated indexes?
> 

Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-16 Thread Konstantin Knizhnik

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.


I am thinking about testing you patch with Neon (cloud Postgres). As far 
as Neon seaprates compute and storage, prefetch is much more critical 
for Neon

architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX 
MERGE), motivating it that in this case OS read-ahead will be more 
efficient than prefetch. It may be true for normal storage devices, bit 
not for Neon storage and may be also for Postgres on top of DFS (i.e. 
Amazon RDS). I wonder if we can delegate decision whether to perform 
prefetch in this case or not to some other level. I do not know 
precisely where is should be handled. The best candidate IMHO is 
storager manager. But it most likely requires extension of SMGR API. Not 
sure if you want to do it... Straightforward solution is to move this 
logic to some callback, which can be overwritten by user.


2. It disables prefetch for direct_io. It seems to be even more obvious 
than 1), because prefetching using `posix_fadvise` definitely not 
possible in case of using direct_io. But in theory if SMGR provides some 
alternative prefetch implementation (as in case of Neon), this also may 
be not true. Still unclear why we can want to use direct_io in Neon... 
But still I prefer to mo.ve this decision outside executor.


3. It doesn't perform prefetch of leave pages for IOS, only referenced 
heap pages which are not marked as all-visible. It seems to me that if 
optimized has chosen IOS (and not bitmap heap scan for example), then 
there should be large enough fraction for all-visible pages. Also index 
prefetch is most efficient for OLAp queries and them are used to be 
performance for historical data which is all-visible. But IOS can be 
really handled separately in some other PR. Frankly speaking combining 
prefetch of leave B-Tree pages and referenced heap pages seems to be 
very challenged task.


4. I think that performing prefetch at executor level is really great 
idea and so prefetch can be used by all indexes, including custom 
indexes. But prefetch will be efficient only if index can provide fast 
access to next TID (located at the same page). I am not sure that it is 
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for 
custom AM. I wonder if we should extend AM API to make index make a 
decision weather to perform prefetch of TIDs or not.


5. Minor notice: there are few places where index_getnext_slot is called 
with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this 
places corresponds to "point loopkup", i.e. unique constraint check, 
find replication tuple by index... Prefetch seems to be unlikely useful 
here, unlkess there is index bloating and and we have to skip a lot of 
tuples before locating right one. But should we try to optimize case of 
bloated indexes?


Re: index prefetching

2024-01-12 Thread Robert Haas
Not a full response, but just to address a few points:

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
 wrote:
> Thinking about this, I think it should be possible to make prefetching
> work even for plans with execute_once=false. In particular, when the
> plan changes direction it should be possible to simply "walk back" the
> prefetch queue, to get to the "correct" place in in the scan. But I'm
> not sure it's worth it, because plans that change direction often can't
> really benefit from prefetches anyway - they'll often visit stuff they
> accessed shortly before anyway. For plans that don't change direction
> but may pause, we don't know if the plan pauses long enough for the
> prefetched pages to get evicted or something. So I think it's OK that
> execute_once=false means no prefetching.

+1.

> > + * XXX We do add the cache size to the request in order not to
> > + * have issues with uint64 underflows.
> >
> > I don't know what this means.
> >
>
> There's a check that does this:
>
>   (x + PREFETCH_CACHE_SIZE) >= y
>
> it might also be done as "mathematically equivalent"
>
>   x >= (y - PREFETCH_CACHE_SIZE)
>
> but if the "y" is an uint64, and the value is smaller than the constant,
> this would underflow. It'd eventually disappear, once the "y" gets large
> enough, ofc.

The problem is, I think, that there's no particular reason that
someone reading the existing code should imagine that it might have
been done in that "mathematically equivalent" fashion. I imagined that
you were trying to make a point about adding the cache size to the
request vs. adding nothing, whereas in reality you were trying to make
a point about adding from one side vs. subtracting from the other.

> > + * We reach here if the index only scan is not parallel, or if we're
> > + * serially executing an index only scan that was planned to be
> > + * parallel.
> >
> > Well, this seems sad.
>
> Stale comment, I believe. However, I didn't see much benefits with
> parallel index scan during testing. Having I/O from multiple workers
> generally had the same effect, I think.

Fair point, likely worth mentioning explicitly in the comment.

> Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
> to keep it consistent. And then the constants are all capital, ofc.

It'd still be nice to get table or heap in there, IMHO, but maybe we
can't, and consistency is certainly a good thing regardless of the
details, so thanks for that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: index prefetching

2024-01-12 Thread Tomas Vondra
tching, after reading the next
> + * index page (or rather after rescan)?
> 
> It seems questionable to use plan_rows here because (1) I don't think
> we have existing cases where we use the estimated row count in the
> executor for anything, we just carry it through so EXPLAIN can print
> it and (2) row count estimates can be really far off, especially if
> we're on the inner side of a nested loop, we might like to figure that
> out eventually instead of just DTWT forever. But on the other hand
> this does feel like an important case where we have a clue that
> prefetching might need to be done less aggressively or not at all, and
> it doesn't seem right to ignore that signal either. I wonder if we
> want this shaped in some other way, like a Boolean that says
> are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
> side of a semi-join or anti-join.
> 

The current code actually does look at plan_rows when calculating the
prefetch target:

  prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
node->ss.ps.plan->plan_rows,
estate->es_use_prefetching);

but I agree maybe it should not, for the reasons you explain. I'm not
attached to this part.


> + * We reach here if the index only scan is not parallel, or if we're
> + * serially executing an index only scan that was planned to be
> + * parallel.
> 
> Well, this seems sad.
> 

Stale comment, I believe. However, I didn't see much benefits with
parallel index scan during testing. Having I/O from multiple workers
generally had the same effect, I think.

> + * XXX This might lead to IOS being slower than plain index scan, if the
> + * table has a lot of pages that need recheck.
> 
> How?
> 

The comment is not particularly clear what "this" means, but I believe
this was about index-only scan with many not-all-visible pages. If it
didn't do prefetching, a regular index scan with prefetching may be way
faster. But the code actually allows doing prefetching even for IOS, by
checking the vm in the "next" callback.

> +/*
> + * XXX Only allow index prefetching when parallelModeOK=true. This is a 
> bit
> + * of a misuse of the flag, but we need to disable prefetching for 
> cursors
> + * (which might change direction), and parallelModeOK does that. But 
> maybe
> + * we might (or should) have a separate flag.
> + */
> 
> I think the correct flag to be using here is execute_once, which
> captures whether the executor could potentially be invoked a second
> time for the same portal. Changes in the fetch direction are possible
> if and only if !execute_once.
> 

Right. The new patch version does that.

>> Note 1: The IndexPrefetch name is a bit misleading, because it's used
>> even with prefetching disabled - all index reads from the index scan
>> happen through it. Maybe it should be called IndexReader or something
>> like that.
> 
> My biggest gripe here is the capitalization. This version adds, inter
> alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
> index_heap_prefetch_target, which seems like one or two too many
> conventions. But maybe the PREFETCH_* macros don't even belong in a
> public header.
> 
> I do like the index_heap_prefetch_* naming. Possibly that's too
> verbose to use for everything, but calling this index-heap-prefetch
> rather than index-prefetch seems clearer.
> 

Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
to keep it consistent. And then the constants are all capital, ofc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom a3f99cc0aaa64ef94b09fc0a58bee709cd29add9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 17 Nov 2023 23:54:19 +0100
Subject: [PATCH v20240112] Prefetch heap pages during index scans

Index scans are a significant source of random I/O on the indexed heap,
but can't benefit from kernel read-ahead. For bitmap scans that is not
an issue, because they do prefetch explicitly, but for plain index scans
this is a major bottleneck - reading page at a time does not allow
saturating modern storage systems.

This enhances index scans (including index-only scans) to prefetch heap
pages. The scan maintains a queue of future TIDs received from an index,
prefetch the associated heap page, and then eventually pass the TID to
the caller.

To eliminate unnecessary prefetches, a small cache of recent prefetches
is maintained, and the prefetches are skipped. Furthermore, sequential
access patterns are detected and not prefetched, on the assumption that
the kernel read-ahead will do this more efficiently.

These optimizations are best-effort heuristics - we don't know if 

Re: index prefetching

2024-01-09 Thread Robert Haas
e during prefetching, after reading the next
+ * index page (or rather after rescan)?

It seems questionable to use plan_rows here because (1) I don't think
we have existing cases where we use the estimated row count in the
executor for anything, we just carry it through so EXPLAIN can print
it and (2) row count estimates can be really far off, especially if
we're on the inner side of a nested loop, we might like to figure that
out eventually instead of just DTWT forever. But on the other hand
this does feel like an important case where we have a clue that
prefetching might need to be done less aggressively or not at all, and
it doesn't seem right to ignore that signal either. I wonder if we
want this shaped in some other way, like a Boolean that says
are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
side of a semi-join or anti-join.

+ * We reach here if the index only scan is not parallel, or if we're
+ * serially executing an index only scan that was planned to be
+ * parallel.

Well, this seems sad.

+ * XXX This might lead to IOS being slower than plain index scan, if the
+ * table has a lot of pages that need recheck.

How?

+    /*
+ * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+ * of a misuse of the flag, but we need to disable prefetching for cursors
+ * (which might change direction), and parallelModeOK does that. But maybe
+ * we might (or should) have a separate flag.
+ */

I think the correct flag to be using here is execute_once, which
captures whether the executor could potentially be invoked a second
time for the same portal. Changes in the fetch direction are possible
if and only if !execute_once.

> Note 1: The IndexPrefetch name is a bit misleading, because it's used
> even with prefetching disabled - all index reads from the index scan
> happen through it. Maybe it should be called IndexReader or something
> like that.

My biggest gripe here is the capitalization. This version adds, inter
alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
index_heap_prefetch_target, which seems like one or two too many
conventions. But maybe the PREFETCH_* macros don't even belong in a
public header.

I do like the index_heap_prefetch_* naming. Possibly that's too
verbose to use for everything, but calling this index-heap-prefetch
rather than index-prefetch seems clearer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: index prefetching

2024-01-04 Thread Tomas Vondra
Hi,

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

(1) The StreamingRead API seems to be designed for pages, but the index
code naturally works with TIDs/tuples. Yes, the callbacks can associate
the blocks with custom data (in this case that'd be the TID), but it
seemed a bit strange ...

(2) The place adding requests to the StreamingRead queue is pretty far
from the place actually reading the pages - for prefetching, the
requests would be generated in nodeIndexscan, but the page reading
happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple.
Sure, the TIDs would come from a callback, so it's a bit as if the
requests were generated in heapam_index_fetch_tuple - but it has no idea
StreamingRead exists, so where would it get it.

We might teach it about it, but what if there are multiple places
calling index_fetch_heap()? Not all of which may be using StreamingRead
(only indexscans would do that). Or if there are multiple index scans,
there's need to be a separate StreamingRead queues, right?

In any case, I felt a bit out of my depth here, and I chose not to do
all this work without discussing the direction here. (Also, see the
point about cursors and xs_heap_continue a bit later in this post.)


I did however like the general StreamingRead API - how it splits the
work between the API and the callback. The patch used to do everything,
which meant it hardcoded a lot of the IOS-specific logic etc. I did plan
to have some sort of "callback" for reading from the queue, but that
didn't quite solve this issue - a lot of the stuff remained hard-coded.
But the StreamingRead API made me realize that having a callback for the
first phase (that adds requests to the queue) would fix that.

So I did that - there's now one simple callback in for index scans, and
a bit more complex callback for index-only scans. Thanks to this the
hard-coded stuff mostly disappears, which is good.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch introduces an API similar to StreamingRead. It calls the
indexam.c stuff, but does all the prefetching on top of it, not in it.
If a place calling index_getnext_tid() wants to allow prefetching, it
needs to switch to IndexPrefetchNext(). (There's no function that would
replace index_getnext_slot, at the moment. Maybe there should be.)

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

Note 2: I left the code in indexam.c for now, but in principle it could
(should) be moved to a different place.

I think this layering makes sense, and it's probably much closer to what
Andres meant when he said the prefetching should happen in the executor.
Even if the patch ends up using StreamingRead in the future, I guess
we'll want something like IndexPrefetch - it might use the StreamingRead
internally, but it would still need to do some custom stuff to detect
I/O patterns or something that does not quite fit into the StreamingRead.


Now, let's talk about two (mostly unrelated) problems I ran into.

Firstly, I realized there's a bit of a problem with cursors. The
prefetching works like this:

1) reading TIDs from the index
2) stashing them into a queue in IndexPrefetch
3) doing prefetches for the new TIDs added to the queue
4) returning the TIDs to the caller, one by one

And all of this works ... unless the direction of the scan changes.
Which for cursors can happen if someone does FETCH BACKWARD or stuff
like that. I'm not sure how difficult it'd be to make this work. I
suppose we could simply discard the prefetched entries and do the right
number of steps back for the index scan. But I haven't tried, and maybe
it's more complex than I'm imagining. Also, if the cursor changes the
direction a lot, it'd make the prefetching harmful.

The patch simply disables prefetching for such queries, using the same
logic that we do for parallelism. This may be over-zealous.

FWIW this is one of the things that probably should remain outside of
StreamingRead API - it seems pretty index-specific, and I'm not sure
we'd even want to support these "backward" movements in the API.


The other issue I'm aware of is handling xs_heap_continue. I believe it
works fine for "false" but I need to take a look at non-MVCC snapshots
(i.e. when xs_heap_continue=true).


I haven't done any benchmarks with this reworked API - there's a couple
more allocations etc. but it did not change in a 

Re: index prefetching

2023-12-21 Thread Tomas Vondra
On 12/21/23 18:14, Robert Haas wrote:
> On Thu, Dec 21, 2023 at 11:08 AM Andres Freund  wrote:
>> But I'd like you to feel guilty (no, not really) and fix it (yes, really) :)
> 
> Sadly, you're more likely to get the first one than you are to get the
> second one. I can't really see going back to revisit that decision as
> a basis for somebody else's new work -- it'd be better if the person
> doing the new work figured out what makes sense here.
> 

I think it's a great example of "hindsight is 20/20". There were
perfectly valid reasons to have two separate nodes, and it's not like
these reasons somehow disappeared. It still is a perfectly reasonable
decision.

It's just that allowing index-only filters for regular index scans seems
to eliminate pretty much all executor differences between the two nodes.
But that's hard to predict - I certainly would not have even think about
that back when index-only scans were added.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2023-12-21 Thread Robert Haas
On Thu, Dec 21, 2023 at 11:08 AM Andres Freund  wrote:
> But I'd like you to feel guilty (no, not really) and fix it (yes, really) :)

Sadly, you're more likely to get the first one than you are to get the
second one. I can't really see going back to revisit that decision as
a basis for somebody else's new work -- it'd be better if the person
doing the new work figured out what makes sense here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: index prefetching

2023-12-21 Thread Andres Freund
Hi,

On 2023-12-21 11:00:34 -0500, Robert Haas wrote:
> On Thu, Dec 21, 2023 at 10:33 AM Tomas Vondra
>  wrote:
> > > I continue to think that we should not have split plain and index only 
> > > scans
> > > into separate files...
> >
> > I do agree with that opinion. Not just because of this prefetching
> > thread, but also because of the discussions about index-only filters in
> > a nearby thread.
> 
> For the record, in the original patch I submitted for this feature, it
> wasn't in separate files. If memory serves, Tom changed it.
> 
> So don't blame me. :-)

But I'd like you to feel guilty (no, not really) and fix it (yes, really) :)

Greetings,

Andres Freund




Re: index prefetching

2023-12-21 Thread Robert Haas
On Thu, Dec 21, 2023 at 10:33 AM Tomas Vondra
 wrote:
> > I continue to think that we should not have split plain and index only scans
> > into separate files...
>
> I do agree with that opinion. Not just because of this prefetching
> thread, but also because of the discussions about index-only filters in
> a nearby thread.

For the record, in the original patch I submitted for this feature, it
wasn't in separate files. If memory serves, Tom changed it.

So don't blame me. :-)

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: index prefetching

2023-12-21 Thread Andres Freund
Hi,

On 2023-12-21 16:20:45 +0100, Tomas Vondra wrote:
> On 12/21/23 14:43, Andres Freund wrote:
> >> AFAICS this seems similar to some of the AIO patch, I wonder what that
> >> plans to do. I need to check.
> > 
> > Yes, most of this exists there.  The difference that with the AIO you don't
> > need to prefetch, as you can just initiate the IO for real, and wait for it 
> > to
> > complete.
> > 
> 
> Right, although the line where things stop being "prefetch" and becomes
> "async" seems a bit unclear to me / perhaps more a point of view.

Agreed. What I meant with not needing prefetching was that you'd not use
fadvise(), because it's better to instead just asynchronously read data into
shared buffers. That way you don't have the doubling of syscalls and you don't
need to care less about the buffering rate in the kernel.

Greetings,

Andres Freund




Re: index prefetching

2023-12-21 Thread Tomas Vondra



On 12/21/23 14:27, Andres Freund wrote:
> Hi,
> 
> On 2023-12-09 19:08:20 +0100, Tomas Vondra wrote:
>> But there's a layering problem that I don't know how to solve - I don't
>> see how we could make indexam.c entirely oblivious to the prefetching,
>> and move it entirely to the executor. Because how else would you know
>> what to prefetch?
> 
>> With index_getnext_tid() I can imagine fetching XIDs ahead, stashing
>> them into a queue, and prefetching based on that. That's kinda what the
>> patch does, except that it does it from inside index_getnext_tid(). But
>> that does not work for index_getnext_slot(), because that already reads
>> the heap tuples.
> 
>> We could say prefetching only works for index_getnext_tid(), but that
>> seems a bit weird because that's what regular index scans do. (There's a
>> patch to evaluate filters on index, which switches index scans to
>> index_getnext_tid(), so that'd make prefetching work too, but I'd ignore
>> that here.
> 
> I think we should just switch plain index scans to index_getnext_tid(). It's
> one of the primary places triggering index scans, so a few additional lines
> don't seem problematic.
> 
> I continue to think that we should not have split plain and index only scans
> into separate files...
> 

I do agree with that opinion. Not just because of this prefetching
thread, but also because of the discussions about index-only filters in
a nearby thread.

> 
>> There are other index_getnext_slot() callers, and I don't
>> think we should accept does not work for those places seems wrong (e.g.
>> execIndexing/execReplication would benefit from prefetching, I think).
> 
> I don't think it'd be a problem to have to opt into supporting
> prefetching. There's plenty places where it doesn't really seem likely to be
> useful, e.g. doing prefetching during syscache lookups is very likely just a
> waste of time.
> 
> I don't think e.g. execReplication is likely to benefit from prefetching -
> you're just fetching a single row after all. You'd need a lot of dead rows to
> make it beneficial.  I think it's similar in execIndexing.c.
> 

Yeah, systable scans are unlikely to benefit from prefetching of this
type. I'm not sure about execIndexing/execReplication, it wasn't clear
to me but maybe you're right.

> 
> I suspect we should work on providing executor nodes with some estimates about
> the number of rows that are likely to be consumed. If an index scan is under a
> LIMIT 1, we shoulnd't prefetch. Similar for sequential scan with the
> infrastructure in
> https://postgr.es/m/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
> 

Isn't this mostly addressed by the incremental ramp-up at the beginning?
Even with target set to 1000, we only start prefetching 1, 2, 3, ...
blocks ahead, it's not like we'll prefetch 1000 blocks right away.

I did initially plan to also consider the number of rows we're expected
to need, but I think it's actually harder than it might seem. With LIMIT
for example we often don't know how selective the qual is, it's not like
we can just stop prefetching after the reading the first N tids. With
other nodes it's good to remember those are just estimates - it'd be
silly to be bitten both by a wrong estimate and also prefetching doing
the wrong thing based on an estimate.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2023-12-21 Thread Tomas Vondra



On 12/21/23 14:43, Andres Freund wrote:
> Hi,
> 
> On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote:
>> You're right a lot of this is a guesswork. I don't think we can do much
>> better, because it depends on stuff that's out of our control - each OS
>> may do things differently, or perhaps it's just configured differently.
>>
>> But I don't think this is really a serious issue - all the read-ahead
>> implementations need to work about the same, because they are meant to
>> work in a transparent way.
>>
>> So it's about deciding at which point we think this is a sequential
>> pattern. Yes, the OS may use a slightly different threshold, but the
>> exact value does not really matter - in the worst case we prefetch a
>> couple more/fewer blocks.
>>
>> The OS read-ahead can't really prefetch anything except sequential
>> cases, so the whole question is "When does the access pattern get
>> sequential enough?". I don't think there's a perfect answer, and I don't
>> think we need a perfect one - we just need to be reasonably close.
> 
> For the streaming read interface (initially backed by fadvise, to then be
> replaced by AIO) we found that it's clearly necessary to avoid fadvises in
> cases of actual sequential IO - the overhead otherwise leads to easily
> reproducible regressions.  So I don't think we have much choice.
> 

Yeah, the regression are pretty easy to demonstrate. In fact, I didn't
have such detection in the first patch, but after the first round of
benchmarks it became obvious it's needed.

> 
>> Also, while I don't want to lazily dismiss valid cases that might be
>> affected by this, I think that sequential access for index paths is not
>> that common (with the exception of clustered indexes).
> 
> I think sequential access is common in other cases as well. There's lots of
> indexes where heap tids are almost perfectly correlated with index entries,
> consider insert only insert-only tables and serial PKs or inserted_at
> timestamp columns.  Even leaving those aside, for indexes with many entries
> for the same key, we sort by tid these days, which will also result in
> "runs" of sequential access.
> 

True. I should have thought about those cases.

> 
>> Obviously, the latter case has much more severe impact, but it depends
>> on the exact workload / access pattern etc. The only "perfect" solution
>> would be to actually check the page cache, but well - that seems to be
>> fairly expensive.
> 
>> What I was envisioning was something self-tuning, based on the I/O we
>> may do later. If the prefetcher decides to prefetch something, but finds
>> it's already in cache, we'd increase the distance, to remember more
>> blocks. Likewise, if a block is not prefetched but then requires I/O
>> later, decrease the distance. That'd make it adaptive, but I don't think
>> we actually have the info about I/O.
> 
> How would the prefetcher know that hte data wasn't in cache?
> 

I don't think there's a good way to do that, unfortunately, or at least
I'm not aware of it. That's what I meant by "we don't have the info" at
the end. Which is why I haven't tried implementing it.

The only "solution" I could come up with was some sort of "timing" for
the I/O requests and deducing what was cached. Not great, of course.

> 
>> Alternatively, I was thinking about moving the prefetches into a
>> separate worker process (or multiple workers), so we'd just queue the
>> request and all the overhead would be done by the worker. The main
>> problem is the overhead of calling posix_fadvise() for blocks that are
>> already in memory, and this would just move it to a separate backend. I
>> wonder if that might even make the custom cache unnecessary / optional.
> 
> The AIO patchset provides this.
> 

OK, I guess it's time for me to take a look at the patch again.

> 
>> AFAICS this seems similar to some of the AIO patch, I wonder what that
>> plans to do. I need to check.
> 
> Yes, most of this exists there.  The difference that with the AIO you don't
> need to prefetch, as you can just initiate the IO for real, and wait for it to
> complete.
> 

Right, although the line where things stop being "prefetch" and becomes
"async" seems a bit unclear to me / perhaps more a point of view.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2023-12-21 Thread Andres Freund
Hi,

On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote:
> You're right a lot of this is a guesswork. I don't think we can do much
> better, because it depends on stuff that's out of our control - each OS
> may do things differently, or perhaps it's just configured differently.
> 
> But I don't think this is really a serious issue - all the read-ahead
> implementations need to work about the same, because they are meant to
> work in a transparent way.
> 
> So it's about deciding at which point we think this is a sequential
> pattern. Yes, the OS may use a slightly different threshold, but the
> exact value does not really matter - in the worst case we prefetch a
> couple more/fewer blocks.
> 
> The OS read-ahead can't really prefetch anything except sequential
> cases, so the whole question is "When does the access pattern get
> sequential enough?". I don't think there's a perfect answer, and I don't
> think we need a perfect one - we just need to be reasonably close.

For the streaming read interface (initially backed by fadvise, to then be
replaced by AIO) we found that it's clearly necessary to avoid fadvises in
cases of actual sequential IO - the overhead otherwise leads to easily
reproducible regressions.  So I don't think we have much choice.


> Also, while I don't want to lazily dismiss valid cases that might be
> affected by this, I think that sequential access for index paths is not
> that common (with the exception of clustered indexes).

I think sequential access is common in other cases as well. There's lots of
indexes where heap tids are almost perfectly correlated with index entries,
consider insert only insert-only tables and serial PKs or inserted_at
timestamp columns.  Even leaving those aside, for indexes with many entries
for the same key, we sort by tid these days, which will also result in
"runs" of sequential access.


> Obviously, the latter case has much more severe impact, but it depends
> on the exact workload / access pattern etc. The only "perfect" solution
> would be to actually check the page cache, but well - that seems to be
> fairly expensive.

> What I was envisioning was something self-tuning, based on the I/O we
> may do later. If the prefetcher decides to prefetch something, but finds
> it's already in cache, we'd increase the distance, to remember more
> blocks. Likewise, if a block is not prefetched but then requires I/O
> later, decrease the distance. That'd make it adaptive, but I don't think
> we actually have the info about I/O.

How would the prefetcher know that hte data wasn't in cache?


> Alternatively, I was thinking about moving the prefetches into a
> separate worker process (or multiple workers), so we'd just queue the
> request and all the overhead would be done by the worker. The main
> problem is the overhead of calling posix_fadvise() for blocks that are
> already in memory, and this would just move it to a separate backend. I
> wonder if that might even make the custom cache unnecessary / optional.

The AIO patchset provides this.


> AFAICS this seems similar to some of the AIO patch, I wonder what that
> plans to do. I need to check.

Yes, most of this exists there.  The difference that with the AIO you don't
need to prefetch, as you can just initiate the IO for real, and wait for it to
complete.

Greetings,

Andres Freund




Re: index prefetching

2023-12-21 Thread Andres Freund
Hi,

On 2023-12-09 19:08:20 +0100, Tomas Vondra wrote:
> But there's a layering problem that I don't know how to solve - I don't
> see how we could make indexam.c entirely oblivious to the prefetching,
> and move it entirely to the executor. Because how else would you know
> what to prefetch?

> With index_getnext_tid() I can imagine fetching XIDs ahead, stashing
> them into a queue, and prefetching based on that. That's kinda what the
> patch does, except that it does it from inside index_getnext_tid(). But
> that does not work for index_getnext_slot(), because that already reads
> the heap tuples.

> We could say prefetching only works for index_getnext_tid(), but that
> seems a bit weird because that's what regular index scans do. (There's a
> patch to evaluate filters on index, which switches index scans to
> index_getnext_tid(), so that'd make prefetching work too, but I'd ignore
> that here.

I think we should just switch plain index scans to index_getnext_tid(). It's
one of the primary places triggering index scans, so a few additional lines
don't seem problematic.

I continue to think that we should not have split plain and index only scans
into separate files...


> There are other index_getnext_slot() callers, and I don't
> think we should accept does not work for those places seems wrong (e.g.
> execIndexing/execReplication would benefit from prefetching, I think).

I don't think it'd be a problem to have to opt into supporting
prefetching. There's plenty places where it doesn't really seem likely to be
useful, e.g. doing prefetching during syscache lookups is very likely just a
waste of time.

I don't think e.g. execReplication is likely to benefit from prefetching -
you're just fetching a single row after all. You'd need a lot of dead rows to
make it beneficial.  I think it's similar in execIndexing.c.


I suspect we should work on providing executor nodes with some estimates about
the number of rows that are likely to be consumed. If an index scan is under a
LIMIT 1, we shoulnd't prefetch. Similar for sequential scan with the
infrastructure in
https://postgr.es/m/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

Greetings,

Andres Freund




Re: index prefetching

2023-12-21 Thread Tomas Vondra



On 12/21/23 07:49, Dilip Kumar wrote:
> On Wed, Dec 20, 2023 at 7:11 AM Tomas Vondra
>  wrote:
>>
> I was going through to understand the idea, couple of observations
> 
> --
> + for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
> + {
> + entry = >prefetchCache[lru * PREFETCH_LRU_SIZE + i];
> +
> + /* Is this the oldest prefetch request in this LRU? */
> + if (entry->request < oldestRequest)
> + {
> + oldestRequest = entry->request;
> + oldestIndex = i;
> + }
> +
> + /*
> + * If the entry is unused (identified by request being set to 0),
> + * we're done. Notice the field is uint64, so empty entry is
> + * guaranteed to be the oldest one.
> + */
> + if (entry->request == 0)
> + continue;
> 
> If the 'entry->request == 0' then we should break instead of continue, right?
> 

Yes, I think that's true. The small LRU caches are accessed/filled
linearly, so once we find an empty entry, all following entries are
going to be empty too.

I thought this shouldn't make any difference, because the LRUs are very
small (only 8 entries, and I don't think we should make them larger).
And it's going to go away once the cache gets full. But now that I think
about it, maybe this could matter for small queries that only ever hit a
couple rows. Hmmm, I'll have to check.

Thanks for noticing this!

> ---
> /*
>  * Used to detect sequential patterns (and disable prefetching).
>  */
> #define PREFETCH_QUEUE_HISTORY 8
> #define PREFETCH_SEQ_PATTERN_BLOCKS 4
> 
> If for sequential patterns we search only 4 blocks then why we are
> maintaining history for 8 blocks
> 
> ---

Right, I think there's no reason to keep these two separate constants. I
believe this is a remnant from an earlier patch version which tried to
do something smarter, but I ended up abandoning that.

> 
> + *
> + * XXX Perhaps this should be tied to effective_io_concurrency somehow?
> + *
> + * XXX Could it be harmful that we read the queue backwards? Maybe memory
> + * prefetching works better for the forward direction?
> + */
> + for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
> 
> Correct, I think if we fetch this forward it will have an advantage
> with memory prefetching.
> 

OK, although we only really have a couple uint32 values, so it should be
the same cacheline I guess.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2023-12-21 Thread Tomas Vondra
On 12/20/23 20:09, Robert Haas wrote:
> On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra
> ...
>> I have imagined something like this:
>>
>> nodeIndexscan / index_getnext_slot()
>> -> no callback, all TIDs are prefetched
>>
>> nodeIndexonlyscan / index_getnext_tid()
>> -> callback checks VM for the TID, prefetches if not all-visible
>> -> the VM check result is stored in the queue with the VM (but in an
>>extensible way, so that other callback can store other stuff)
>> -> index_getnext_tid() also returns this extra information
>>
>> So not that different from the WIP patch, but in a "generic" and
>> extensible way. Instead of hard-coding the all-visible flag, there'd be
>> a something custom information. A bit like qsort_r() has a void* arg to
>> pass custom context.
>>
>> Or if envisioned something different, could you elaborate a bit?
> 
> I can't totally follow the sketch you give above, but I think we're
> thinking along similar lines, at least.
> 

Yeah, it's hard to discuss vague descriptions of code that does not
exist yet. I'll try to do the actual patch, then we can discuss.

>>> index_prefetch_is_sequential() makes me really nervous
>>> because it seems to depend an awful lot on whether the OS is doing
>>> prefetching, and how the OS is doing prefetching, and I think those
>>> might not be consistent across all systems and kernel versions.
>>
>> If the OS does not have read-ahead, or it's not configured properly,
>> then the patch does not perform worse than what we have now. I'm far
>> more concerned about the opposite issue, i.e. causing regressions with
>> OS-level read-ahead. And the check handles that well, I think.
> 
> I'm just not sure how much I believe that it's going to work well
> everywhere. I mean, I have no evidence that it doesn't, it just kind
> of looks like guesswork to me. For instance, the behavior of the
> algorithm depends heavily on PREFETCH_QUEUE_HISTORY and
> PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is
> to say that on some system or workload you didn't test the required
> values aren't entirely different, or that the whole algorithm doesn't
> need rethinking? Maybe we can't really answer that question perfectly,
> but the patch doesn't really explain the reasoning behind this choice
> of algorithm.
> 

You're right a lot of this is a guesswork. I don't think we can do much
better, because it depends on stuff that's out of our control - each OS
may do things differently, or perhaps it's just configured differently.

But I don't think this is really a serious issue - all the read-ahead
implementations need to work about the same, because they are meant to
work in a transparent way.

So it's about deciding at which point we think this is a sequential
pattern. Yes, the OS may use a slightly different threshold, but the
exact value does not really matter - in the worst case we prefetch a
couple more/fewer blocks.

The OS read-ahead can't really prefetch anything except sequential
cases, so the whole question is "When does the access pattern get
sequential enough?". I don't think there's a perfect answer, and I don't
think we need a perfect one - we just need to be reasonably close.

Also, while I don't want to lazily dismiss valid cases that might be
affected by this, I think that sequential access for index paths is not
that common (with the exception of clustered indexes).

FWIW bitmap index scans have exactly the same "problem" except that no
one cares about it because that's how it worked from the start, so it's
not considered a regression.

>>> Similarly with index_prefetch(). There's a lot of "magical"
>>> assumptions here. Even index_prefetch_add_cache() has this problem --
>>> the function assumes that it's OK if we sometimes fail to detect a
>>> duplicate prefetch request, which makes sense, but under what
>>> circumstances is it necessary to detect duplicates and in what cases
>>> is it optional? The function comments are silent about that, which
>>> makes it hard to assess whether the algorithm is good enough.
>>
>> I don't quite understand what problem with duplicates you envision here.
>> Strictly speaking, we don't need to detect/prevent duplicates - it's
>> just that if you do posix_fadvise() for a block that's already in
>> memory, it's overhead / wasted time. The whole point is to not do that
>> very often. In this sense it's entirely optional, but desirable.
> 
> Right ... but the patch sets up some data structure that will
> eliminate duplicates in some circumstances and fail to eliminate them
> in others. So it's making a judgement that the things it catches are
> the cases that are important enough that we need to catch them, and
> the things that it doesn't catch are cases that aren't particularly
> important to catch. Here again, PREFETCH_LRU_SIZE and
> PREFETCH_LRU_COUNT seem like they will have a big impact, but why
> these values? The comments suggest that it's because we want to cover
> ~8MB of data, but it's not clear 

Re: index prefetching

2023-12-20 Thread Dilip Kumar
On Wed, Dec 20, 2023 at 7:11 AM Tomas Vondra
 wrote:
>
I was going through to understand the idea, couple of observations

--
+ for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+ {
+ entry = >prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+ /* Is this the oldest prefetch request in this LRU? */
+ if (entry->request < oldestRequest)
+ {
+ oldestRequest = entry->request;
+ oldestIndex = i;
+ }
+
+ /*
+ * If the entry is unused (identified by request being set to 0),
+ * we're done. Notice the field is uint64, so empty entry is
+ * guaranteed to be the oldest one.
+ */
+ if (entry->request == 0)
+ continue;

If the 'entry->request == 0' then we should break instead of continue, right?

---
/*
 * Used to detect sequential patterns (and disable prefetching).
 */
#define PREFETCH_QUEUE_HISTORY 8
#define PREFETCH_SEQ_PATTERN_BLOCKS 4

If for sequential patterns we search only 4 blocks then why we are
maintaining history for 8 blocks

---

+ *
+ * XXX Perhaps this should be tied to effective_io_concurrency somehow?
+ *
+ * XXX Could it be harmful that we read the queue backwards? Maybe memory
+ * prefetching works better for the forward direction?
+ */
+ for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)

Correct, I think if we fetch this forward it will have an advantage
with memory prefetching.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: index prefetching

2023-12-20 Thread Robert Haas
On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra
 wrote:
> Whatever the right abstraction is, it probably needs to do these VM
> checks only once.

Makes sense.

> Yeah, after you pointed out the "leaky" abstraction, I also started to
> think about customizing the behavior using a callback. Not sure what
> exactly you mean by "fully transparent" but as I explained above I think
> we need to allow passing some information between the prefetcher and the
> executor - for example results of the visibility map checks in IOS.

Agreed.

> I have imagined something like this:
>
> nodeIndexscan / index_getnext_slot()
> -> no callback, all TIDs are prefetched
>
> nodeIndexonlyscan / index_getnext_tid()
> -> callback checks VM for the TID, prefetches if not all-visible
> -> the VM check result is stored in the queue with the VM (but in an
>extensible way, so that other callback can store other stuff)
> -> index_getnext_tid() also returns this extra information
>
> So not that different from the WIP patch, but in a "generic" and
> extensible way. Instead of hard-coding the all-visible flag, there'd be
> a something custom information. A bit like qsort_r() has a void* arg to
> pass custom context.
>
> Or if envisioned something different, could you elaborate a bit?

I can't totally follow the sketch you give above, but I think we're
thinking along similar lines, at least.

> I think if the code stays in indexam.c, it's sensible to keep the index_
> prefix, but then also have a more appropriate rest of the name. For
> example it might be index_prefetch_heap_pages() or something like that.

Yeah, that's not a bad idea.

> > index_prefetch_is_sequential() makes me really nervous
> > because it seems to depend an awful lot on whether the OS is doing
> > prefetching, and how the OS is doing prefetching, and I think those
> > might not be consistent across all systems and kernel versions.
>
> If the OS does not have read-ahead, or it's not configured properly,
> then the patch does not perform worse than what we have now. I'm far
> more concerned about the opposite issue, i.e. causing regressions with
> OS-level read-ahead. And the check handles that well, I think.

I'm just not sure how much I believe that it's going to work well
everywhere. I mean, I have no evidence that it doesn't, it just kind
of looks like guesswork to me. For instance, the behavior of the
algorithm depends heavily on PREFETCH_QUEUE_HISTORY and
PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is
to say that on some system or workload you didn't test the required
values aren't entirely different, or that the whole algorithm doesn't
need rethinking? Maybe we can't really answer that question perfectly,
but the patch doesn't really explain the reasoning behind this choice
of algorithm.

> > Similarly with index_prefetch(). There's a lot of "magical"
> > assumptions here. Even index_prefetch_add_cache() has this problem --
> > the function assumes that it's OK if we sometimes fail to detect a
> > duplicate prefetch request, which makes sense, but under what
> > circumstances is it necessary to detect duplicates and in what cases
> > is it optional? The function comments are silent about that, which
> > makes it hard to assess whether the algorithm is good enough.
>
> I don't quite understand what problem with duplicates you envision here.
> Strictly speaking, we don't need to detect/prevent duplicates - it's
> just that if you do posix_fadvise() for a block that's already in
> memory, it's overhead / wasted time. The whole point is to not do that
> very often. In this sense it's entirely optional, but desirable.

Right ... but the patch sets up some data structure that will
eliminate duplicates in some circumstances and fail to eliminate them
in others. So it's making a judgement that the things it catches are
the cases that are important enough that we need to catch them, and
the things that it doesn't catch are cases that aren't particularly
important to catch. Here again, PREFETCH_LRU_SIZE and
PREFETCH_LRU_COUNT seem like they will have a big impact, but why
these values? The comments suggest that it's because we want to cover
~8MB of data, but it's not clear why that should be the right amount
of data to cover. My naive thought is that we'd want to avoid
prefetching a block during the time between we had prefetched it and
when we later read it, but then the value that is here magically 8MB
should really be replaced by the operative prefetch distance.

> I really don't want to have multiple knobs. At this point we have three
> GUCs, each tuning prefetching for a fairly large part of the system:
>
>   effective_io_concurrency = regular queries
>   maintenance_io_concurrency = utility commands
>   recovery_prefetch = recovery / PITR
>
> This seems sensible, but I really don't want many more GUCs tuning
> prefetching for different executor nodes or something like that.
>
> If we have issues with how effective_io_concurrency works (and 

Re: index prefetching

2023-12-19 Thread Tomas Vondra



On 12/18/23 22:00, Robert Haas wrote:
> On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra
>  wrote:
>> But there's a layering problem that I don't know how to solve - I don't
>> see how we could make indexam.c entirely oblivious to the prefetching,
>> and move it entirely to the executor. Because how else would you know
>> what to prefetch?
> 
> Yeah, that seems impossible.
> 
> Some thoughts:
> 
> * I think perhaps the subject line of this thread is misleading. It
> doesn't seem like there is any index prefetching going on here at all,
> and there couldn't be, unless you extended the index AM API with new
> methods. What you're actually doing is prefetching heap pages that
> will be needed by a scan of the index. I think this confusing naming
> has propagated itself into some parts of the patch, e.g.
> index_prefetch() reads *from the heap* which is not at all clear from
> the comment saying "Prefetch the TID, unless it's sequential or
> recently prefetched." You're not prefetching the TID: you're
> prefetching the heap tuple to which the TID points. That's not an
> academic distinction IMHO -- the TID would be stored in the index, so
> if we were prefetching the TID, we'd have to be reading index pages,
> not heap pages.

Yes, that's a fair complaint. I think the naming is mostly obsolete -
the prefetching initially happened way way lower - in the index AMs. It
was prefetching the heap pages, ofc, but it kinda seemed reasonable to
call it "index prefetching". And even now it's called from indexam.c
where most functions start with "index_".

But I'll think about some better / cleared name.

> 
> * Regarding layering, my first thought was that the changes to
> index_getnext_tid() and index_getnext_slot() are sensible: read ahead
> by some number of TIDs, keep the TIDs you've fetched in an array
> someplace, use that to drive prefetching of blocks on disk, and return
> the previously-read TIDs from the queue without letting the caller
> know that the queue exists. I think that's the obvious design for a
> feature of this type, to the point where I don't really see that
> there's a viable alternative design.

I agree.

> Driving something down into the individual index AMs would make sense
> if you wanted to prefetch *from the indexes*, but it's unnecessary
> otherwise, and best avoided.
> 

Right. In fact, the patch moved exactly in the opposite direction - it
was originally done at the AM level, and moved up. First to indexam.c,
then even more to the executor.

> * But that said, the skip_all_visible flag passed down to
> index_prefetch() looks like a VERY strong sign that the layering here
> is not what it should be. Right now, when some code calls
> index_getnext_tid(), that function does not need to know or care
> whether the caller is going to fetch the heap tuple or not. But with
> this patch, the code does need to care. So knowledge of the executor
> concept of an index-only scan trickles down into indexam.c, which now
> has to be able to make decisions that are consistent with the ones
> that the executor will make. That doesn't seem good at all.
> 

I agree the all_visible flag is a sign the abstraction is not quite
right. I did that mostly to quickly verify whether the duplicate VM
checks are causing for the perf regression (and they are).

Whatever the right abstraction is, it probably needs to do these VM
checks only once.

> * I think it might make sense to have two different prefetching
> schemes. Ideally they could share some structure. If a caller is using
> index_getnext_slot(), then it's easy for prefetching to be fully
> transparent. The caller can just ask for TIDs and the prefetching
> distance and TID queue can be fully under the control of something
> that is hidden from the caller. But when using index_getnext_tid(),
> the caller needs to have an opportunity to evaluate each TID and
> decide whether we even want the heap tuple. If yes, then we feed that
> TID to the prefetcher; if no, we don't. That way, we're not
> replicating executor logic in lower-level code. However, that also
> means that the IOS logic needs to be aware that this TID queue exists
> and interact with whatever controls the prefetch distance. Perhaps
> after calling index_getnext_tid() you call
> index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and
> then you call index_prefetcher_get_tid() to drain the queue. Perhaps
> also the prefetcher has a "fill" callback that gets invoked when the
> TID queue isn't as full as the prefetcher wants it to be. Then
> index_getnext_slot() can just install a trivial fill callback that
> says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...),
> true), but IOS can use a more sophisticated callback that check

Re: index prefetching

2023-12-18 Thread Robert Haas
On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra
 wrote:
> But there's a layering problem that I don't know how to solve - I don't
> see how we could make indexam.c entirely oblivious to the prefetching,
> and move it entirely to the executor. Because how else would you know
> what to prefetch?

Yeah, that seems impossible.

Some thoughts:

* I think perhaps the subject line of this thread is misleading. It
doesn't seem like there is any index prefetching going on here at all,
and there couldn't be, unless you extended the index AM API with new
methods. What you're actually doing is prefetching heap pages that
will be needed by a scan of the index. I think this confusing naming
has propagated itself into some parts of the patch, e.g.
index_prefetch() reads *from the heap* which is not at all clear from
the comment saying "Prefetch the TID, unless it's sequential or
recently prefetched." You're not prefetching the TID: you're
prefetching the heap tuple to which the TID points. That's not an
academic distinction IMHO -- the TID would be stored in the index, so
if we were prefetching the TID, we'd have to be reading index pages,
not heap pages.

* Regarding layering, my first thought was that the changes to
index_getnext_tid() and index_getnext_slot() are sensible: read ahead
by some number of TIDs, keep the TIDs you've fetched in an array
someplace, use that to drive prefetching of blocks on disk, and return
the previously-read TIDs from the queue without letting the caller
know that the queue exists. I think that's the obvious design for a
feature of this type, to the point where I don't really see that
there's a viable alternative design. Driving something down into the
individual index AMs would make sense if you wanted to prefetch *from
the indexes*, but it's unnecessary otherwise, and best avoided.

* But that said, the skip_all_visible flag passed down to
index_prefetch() looks like a VERY strong sign that the layering here
is not what it should be. Right now, when some code calls
index_getnext_tid(), that function does not need to know or care
whether the caller is going to fetch the heap tuple or not. But with
this patch, the code does need to care. So knowledge of the executor
concept of an index-only scan trickles down into indexam.c, which now
has to be able to make decisions that are consistent with the ones
that the executor will make. That doesn't seem good at all.

* I think it might make sense to have two different prefetching
schemes. Ideally they could share some structure. If a caller is using
index_getnext_slot(), then it's easy for prefetching to be fully
transparent. The caller can just ask for TIDs and the prefetching
distance and TID queue can be fully under the control of something
that is hidden from the caller. But when using index_getnext_tid(),
the caller needs to have an opportunity to evaluate each TID and
decide whether we even want the heap tuple. If yes, then we feed that
TID to the prefetcher; if no, we don't. That way, we're not
replicating executor logic in lower-level code. However, that also
means that the IOS logic needs to be aware that this TID queue exists
and interact with whatever controls the prefetch distance. Perhaps
after calling index_getnext_tid() you call
index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and
then you call index_prefetcher_get_tid() to drain the queue. Perhaps
also the prefetcher has a "fill" callback that gets invoked when the
TID queue isn't as full as the prefetcher wants it to be. Then
index_getnext_slot() can just install a trivial fill callback that
says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...),
true), but IOS can use a more sophisticated callback that checks the
VM to determine what to pass for the third argument.

* I realize that I'm being a little inconsistent in what I just said,
because in the first bullet point I said that this wasn't really index
prefetching, and now I'm proposing function names that still start
with index_prefetch. It's not entirely clear to me what the best thing
to do about the terminology is here -- could it be a heap prefetcher,
or a TID prefetcher, or an index scan prefetcher? I don't really know,
but whatever we can do to make the naming more clear seems like a
really good idea. Maybe there should be a clearer separation between
the queue of TIDs that we're going to return from the index and the
queue of blocks that we want to prefetch to get the corresponding heap
tuples -- making that separation crisper might ease some of the naming
issues.

* Not that I want to be critical because I think this is a great start
on an important project, but it does look like there's an awful lot of
stuff here that still needs to be sorted out before it would be
reasonable to think of committing this, both in terms of design
decisions and just general polish. There's a lot of stuff marked with
XXX and I think that's great because most of those seem to be go

Re: index prefetching

2023-07-14 Thread Tomas Vondra
mates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid)
+{
+	IndexPrefetch	prefetch = scan->xs_prefetch;
+	BlockNumber	block;
+
+	/*
+	 * No heap relation means bitmap index scan, which does prefetching at
+	 * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+	 * without the heap)
+	 *
+	 * XXX But in this case we should have prefetchMaxTarget=0, because in
+	 * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+	 * just check that.
+	 */
+	if (!prefetch)
+		return;
+
+	/* was it initialized correctly? */
+	// Assert(prefetch->prefetchIndex != -1);
+
+	/*
+	 * If we got here, prefetching is enabled and it's a node that supports
+	 * prefetching (i.e. it can't be a bitmap index scan).
+	 */
+	Assert(scan->heapRelation);
+
+	prefetch->countAll++;
+
+	block = ItemPointerGetBlockNumber(tid);
+
+	/*
+	 * Do not prefetch the same block over and over again,
+	 *
+	 * This happens e.g. for clustered or naturally correlated indexes
+	 * (fkey to a sequence ID). It's not expensive (the block is in page
+	 * cache already, so no I/O), but it's not free either.
+	 */
+	if (!index_prefetch_add_cache(prefetch, block))
+	{
+		prefetch->countPrefetch++;
+
+		PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+		pgBufferUsage.blks_prefetches++;
+	}
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8570b14f62..6ae445d62c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
   !INSTR_TIME_IS_ZERO(usage->blk_write_time));
 		bool		has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
 	   !INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+		bool		has_prefetches = (usage->blks_prefetches > 0);
 		bool		show_planning = (planning && (has_shared ||
   has_local || has_temp || has_timing ||
   has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			appendStringInfoChar(es->str, '\n');
 		}
 
+		/* As above, show only positive counter values. */
+		if (has_prefetches)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+appendStringInfo(es->str, " blocks=%lld",
+ (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+appendStringInfo(es->str, " rounds=%lld",
+ (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b89..e5ce1dbc95 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, , indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, , indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index e776524227..c0bb732658 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ 

Re: index prefetching

2023-06-30 Thread Tomas Vondra
tring(es->str, "Prefetches:");
+
+			if (usage->blks_prefetches > 0)
+appendStringInfo(es->str, " blocks=%lld",
+ (long long) usage->blks_prefetches);
+
+			if (usage->blks_prefetch_rounds > 0)
+appendStringInfo(es->str, " rounds=%lld",
+ (long long) usage->blks_prefetch_rounds);
+
+			appendStringInfoChar(es->str, '\n');
+		}
+
 		if (show_planning)
 			es->indent--;
 	}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b897..e5ce1dbc953 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 	/*
 	 * May have to restart scan from this point if a potential conflict is
 	 * found.
+	 *
+	 * XXX Should this do index prefetch? Probably not worth it for unique
+	 * constraints, I guess? Otherwise we should calculate prefetch_target
+	 * just like in nodeIndexscan etc.
 	 */
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, , indnkeyatts, 0);
+	index_scan = index_beginscan(heap, index, , indnkeyatts, 0, 0, 0);
 	index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
 
 	while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9dd71684615..a997aac828f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -157,8 +157,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
 	/* Build scan key. */
 	skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
 
-	/* Start an index scan. */
-	scan = index_beginscan(rel, idxrel, , skey_attoff, 0);
+	/* Start an index scan.
+	 *
+	 * XXX Should this do index prefetching? We're looking for a single tuple,
+	 * probably using a PK / UNIQUE index, so does not seem worth it. If we
+	 * reconsider this, calclate prefetch_target like in nodeIndexscan.
+	 */
+	scan = index_beginscan(rel, idxrel, , skey_attoff, 0, 0, 0);
 
 retry:
 	found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	dst->local_blks_written += add->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+	dst->blks_prefetches += add->blks_prefetches;
 	INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
 	INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
 	INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+	dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+	dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 		  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b9699..3ecb8470d47 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * We reach here if the index only scan is not parallel, or if we're
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
+		 *
+		 * XXX Maybe we should enable prefetching, but prefetch only pages that
+		 * are not all-visible (but checking that from the index code seems like
+		 * a violation of layering etc).
+		 *
+		 * XXX This might lead to IOS being slower than plain index scan, if the
+		 * table has a lot of pages that need recheck.
 		 */
 		scandesc = index_beginscan(node->ss.ss_currentRelation,
    node->ioss_RelationDesc,
    estate->es_snapshot,
    node->ioss_NumScanKeys,
-   node->ioss_NumOrderByKeys);
+   node->ioss_NumOrderByKeys,
+   0, 0);	/* no index prefetch for IOS */
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
  node->ioss_RelationDesc,
  node->ioss_NumScanKeys,
  node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0);	/* no index prefet

Re: index prefetching

2023-06-19 Thread Tomas Vondra
 index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
  */
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 Relation indexRelation,
 Snapshot snapshot,
-int nkeys, int norderbys)
+int nkeys, int norderbys,
+int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+	prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -241,7 +261,8 @@ index_beginscan_bitmap(Relation indexRelation,
 
 	Assert(snapshot != InvalidSnapshot);
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+	0, 0); /* no prefetch */
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -258,7 +279,8 @@ index_beginscan_bitmap(Relation indexRelation,
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
 		 int nkeys, int norderbys, Snapshot snapshot,
-		 ParallelIndexScanDesc pscan, bool temp_snap)
+		 ParallelIndexScanDesc pscan, bool temp_snap,
+		 int prefetch_target, int prefetch_reset)
 {
 	IndexScanDesc scan;
 
@@ -276,8 +298,8 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
-norderbys);
+	scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys,
+prefetch_target, prefetch_reset);
 	/* Initialize information for parallel scan. */
 	scan->parallel_scan = pscan;
 	scan->xs_temp_snap = temp_snap;
@@ -317,6 +339,16 @@ index_rescan(IndexScanDesc scan,
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
 			orderbys, norderbys);
+
+	/* If we're prefetching for this index, maybe reset some of the state. */
+	if (scan->xs_prefetch != NULL)
+	{
+		IndexPrefetch prefetcher = scan->xs_prefetch;
+
+		prefetcher->prefetchIndex = -1;
+		prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+		 prefetcher->prefetchReset);
+	}
 }
 
 /* 
@@ -487,10 +519,13 @@ index_parallelrescan(IndexScanDesc scan)
  * index_beginscan_parallel - join parallel index scan
  *
  * Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
  */
 IndexScanDesc
 index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
-		 int norderbys, ParallelIndexScanDesc pscan)
+		 int norderbys, ParallelIndexScanDesc pscan,
+		 int prefetch_target, int prefetch_reset)
 {
 	Snapshot	snapshot;
 	IndexScanDesc scan;
@@ -499,7 +534,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
 	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
 	RegisterSnapshot(snapshot);
 	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
-	pscan, true);
+	pscan, true, prefetch_target, prefetch_reset);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -557,6 +592,9 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
 
+	/* do index prefetching, if needed */
+	index_prefetch(scan, direction);
+
 	/* Return the TID of the tuple we found. */
 	return >xs_heaptid;
 }
@@ -988,3 +1026,228 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
 
 	return build_local_reloptions(, attoptions, validate);
 }
+
+
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items fr

Re: index prefetching

2023-06-12 Thread Dilip Kumar
On Thu, Jun 8, 2023 at 9:10 PM Tomas Vondra
 wrote:

> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans, but
> I suspect the reasoning was that it only helps when there's many
> matching tuples, and that's what bitmap index scans are for. So it was
> not worth the implementation effort.

One of the reasons IMHO is that in the bitmap scan before starting the
heap fetch TIDs are already sorted in heap block order.  So it is
quite obvious that once we prefetch a heap block most of the
subsequent TIDs will fall on that block i.e. each prefetch will
satisfy many immediate requests.  OTOH, in the index scan the I/O
request is very random so we might have to prefetch many blocks even
for satisfying the request for TIDs falling on one index page.  I
agree with prefetching with an index scan will definitely help in
reducing the random I/O, but this is my guess that thinking of
prefetching with a Bitmap scan appears more natural and that would
have been one of the reasons for implementing this only for a bitmap
scan.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: index prefetching

2023-06-12 Thread Tomasz Rybak
On Thu, 2023-06-08 at 17:40 +0200, Tomas Vondra wrote:
> Hi,
> 
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.
> 

I added entry to 
https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference
based on notes I took during that session.
Hope it helps.

-- 
Tomasz Rybak, Debian Developer 
GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C




Re: index prefetching

2023-06-10 Thread Tomas Vondra



On 6/10/23 22:34, Andres Freund wrote:
> Hi,
> 
> On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
>>>
 2) prefetching from executor

 Another question is whether the prefetching shouldn't actually happen
 even higher - in the executor. That's what Andres suggested during the
 unconference, and it kinda makes sense. That's where we do prefetching
 for bitmap heap scans, so why should this happen lower, right?
>>>
>>> Yea. I think it also provides potential for further optimizations in the
>>> future to do it at that layer.
>>>
>>> One thing I have been wondering around this is whether we should not have
>>> split the code for IOS and plain indexscans...
>>>
>>
>> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
>> did you mean something else?
> 
> Yes, I meant that.
> 

Ah, you meant that maybe we shouldn't have done that. Sorry, I
misunderstood.

 4) per-leaf prefetching

 The code is restricted only prefetches items from one leaf page. If the
 index scan needs to scan multiple (many) leaf pages, we have to process
 the first leaf page first before reading / prefetching the next one.

 I think this is acceptable limitation, certainly for v0. Prefetching
 across multiple leaf pages seems way more complex (particularly for the
 cases using pairing heap), so let's leave this for the future.
>>>
>>> Hm. I think that really depends on the shape of the API we end up with. If 
>>> we
>>> move the responsibility more twoards to the executor, I think it very well
>>> could end up being just as simple to prefetch across index pages.
>>>
>>
>> Maybe. I'm open to that idea if you have idea how to shape the API to
>> make this possible (although perhaps not in v0).
> 
> I'll try to have a look.
> 
> 
>>> I'm a bit confused by some of these numbers. How can OS-level prefetching 
>>> lead
>>> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
>>> Unless I missed what "xeon / cached (speedup)" indicates?
>>>
>>
>> I forgot to explain what "cached" means in the TPC-H case. It means
>> second execution of the query, so you can imagine it like this:
>>
>> for q in `seq 1 22`; do
>>
>>1. drop caches and restart postgres
> 
> Are you doing it in that order? If so, the pagecache can end up being seeded
> by postgres writing out dirty buffers.
> 

Actually no, I do it the other way around - first restart, then drop. It
shouldn't matter much, though, because after building the data set (and
vacuum + checkpoint), the data is not modified - all the queries run on
the same data set. So there shouldn't be any dirty buffers.

> 
>>2. run query $q -> uncached
>>
>>3. run query $q -> cached
>>
>> done
>>
>> So the second execution has a chance of having data in memory - but
>> maybe not all, because this is a 100GB data set (so ~200GB after
>> loading), but the machine only has 64GB of RAM.
>>
>> I think a likely explanation is some of the data wasn't actually in
>> memory, so prefetching still did something.
> 
> Ah, ok.
> 
> 
>>> I think it'd be good to run a performance comparison of the unpatched vs
>>> patched cases, with prefetching disabled for both. It's possible that
>>> something in the patch caused unintended changes (say spilling during a
>>> hashagg, due to larger struct sizes).
>>>
>>
>> That's certainly a good idea. I'll do that in the next round of tests. I
>> also plan to do a test on data set that fits into RAM, to test "properly
>> cached" case.
> 
> Cool. It'd be good to measure both the case of all data already being in s_b
> (to see the overhead of the buffer mapping lookups) and the case where the
> data is in the kernel pagecache (to see the overhead of pointless
> posix_fadvise calls).
> 

OK, I'll make sure the next round of tests includes a sufficiently small
data set too. I should have some numbers sometime early next week.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2023-06-10 Thread Andres Freund
Hi,

On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
> > 
> >> 2) prefetching from executor
> >>
> >> Another question is whether the prefetching shouldn't actually happen
> >> even higher - in the executor. That's what Andres suggested during the
> >> unconference, and it kinda makes sense. That's where we do prefetching
> >> for bitmap heap scans, so why should this happen lower, right?
> > 
> > Yea. I think it also provides potential for further optimizations in the
> > future to do it at that layer.
> > 
> > One thing I have been wondering around this is whether we should not have
> > split the code for IOS and plain indexscans...
> > 
> 
> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
> did you mean something else?

Yes, I meant that.

> >> 4) per-leaf prefetching
> >>
> >> The code is restricted only prefetches items from one leaf page. If the
> >> index scan needs to scan multiple (many) leaf pages, we have to process
> >> the first leaf page first before reading / prefetching the next one.
> >>
> >> I think this is acceptable limitation, certainly for v0. Prefetching
> >> across multiple leaf pages seems way more complex (particularly for the
> >> cases using pairing heap), so let's leave this for the future.
> > 
> > Hm. I think that really depends on the shape of the API we end up with. If 
> > we
> > move the responsibility more twoards to the executor, I think it very well
> > could end up being just as simple to prefetch across index pages.
> > 
> 
> Maybe. I'm open to that idea if you have idea how to shape the API to
> make this possible (although perhaps not in v0).

I'll try to have a look.


> > I'm a bit confused by some of these numbers. How can OS-level prefetching 
> > lead
> > to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
> > Unless I missed what "xeon / cached (speedup)" indicates?
> > 
> 
> I forgot to explain what "cached" means in the TPC-H case. It means
> second execution of the query, so you can imagine it like this:
> 
> for q in `seq 1 22`; do
> 
>1. drop caches and restart postgres

Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.


>2. run query $q -> uncached
> 
>3. run query $q -> cached
> 
> done
> 
> So the second execution has a chance of having data in memory - but
> maybe not all, because this is a 100GB data set (so ~200GB after
> loading), but the machine only has 64GB of RAM.
> 
> I think a likely explanation is some of the data wasn't actually in
> memory, so prefetching still did something.

Ah, ok.


> > I think it'd be good to run a performance comparison of the unpatched vs
> > patched cases, with prefetching disabled for both. It's possible that
> > something in the patch caused unintended changes (say spilling during a
> > hashagg, due to larger struct sizes).
> > 
> 
> That's certainly a good idea. I'll do that in the next round of tests. I
> also plan to do a test on data set that fits into RAM, to test "properly
> cached" case.

Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).

Greetings,

Andres Freund




Re: index prefetching

2023-06-09 Thread Gregory Smith
On Thu, Jun 8, 2023 at 11:40 AM Tomas Vondra 
wrote:

> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans


At the point Greg Stark was hacking on this, the underlying OS async I/O
features were tricky to fix into PG's I/O model, and both of us did much
review work just to find working common ground that PG could plug into.
Linux POSIX advisories were completely different from Solaris's async
model, the other OS used for validation that the feature worked, with the
hope being that designing against two APIs would be better than just
focusing on Linux.  Since that foundation was all so brittle and limited,
scope was limited to just the heap scan, since it seemed to have the best
return on time invested given the parts of async I/O that did and didn't
scale as expected.

As I remember it, the idea was to get the basic feature out the door and
gather feedback about things like whether the effective_io_concurrency knob
worked as expected before moving onto other prefetching.  Then that got
lost in filesystem upheaval land, with so much drama around Solaris/ZFS and
Oracle's btrfs work.  I think it's just that no one ever got back to it.

I have all the workloads that I use for testing automated into
pgbench-tools now, and this change would be easy to fit into testing on
them as I'm very heavy on block I/O tests.  To get PG to reach full read
speed on newer storage I've had to do some strange tests, like doing index
range scans that touch 25+ pages.  Here's that one as a pgbench script:

\set range 67 * (:multiplier + 1)
\set limit 10 * :scale
\set limit :limit - :range
\set aid random(1, :limit)
SELECT aid,abalance FROM pgbench_accounts WHERE aid >= :aid ORDER BY aid
LIMIT :range;

And then you use '-Dmultiplier=10' or such to crank it up.  Database 4X
RAM, multiplier=25 with 16 clients is my starting point on it when I want
to saturate storage.  Anything that lets me bring those numbers down would
be valuable.

--
Greg Smith  greg.sm...@crunchydata.com
Director of Open Source Strategy


Re: index prefetching

2023-06-09 Thread Peter Geoghegan
On Fri, Jun 9, 2023 at 3:45 AM Tomas Vondra
 wrote:
> > What the exact historical timeline is may not be that important. My
> > emphasis on ScalarArrayOpExpr is partly due to it being a particularly
> > compelling case for both parallel index scan and prefetching, in
> > general. There are many queries that have huge in() lists that
> > naturally benefit a great deal from prefetching. Plus they're common.
> >
>
> Did you mean parallel index scan or bitmap index scan?

I meant parallel index scan (also parallel bitmap index scan). Note
that nbtree parallel index scans have special ScalarArrayOpExpr
handling code.

ScalarArrayOpExpr is kind of special -- it is simultaneously one big
index scan (to the executor), and lots of small index scans (to
nbtree). Unlike the queries that you've looked at so far, which really
only have one plausible behavior at execution time, there are many
ways that ScalarArrayOpExpr index scans can be executed at runtime --
some much faster than others. The nbtree implementation can in
principle reorder how it processes ranges from the key space (i.e.
each range of array elements) with significant flexibility.

> I think I understand, although maybe my mental model is wrong. I agree
> it seems inefficient, but I'm not sure why would it make prefetching
> hopeless. Sure, it puts index scans at a disadvantage (compared to
> bitmap scans), but it we pick index scan it should still be an
> improvement, right?

Hopeless might have been too strong of a word. More like it'd fall far
short of what is possible to do with a ScalarArrayOpExpr with a given
high end server.

The quality of the implementation (including prefetching) could make a
huge difference to how well we make use of the available hardware
resources. A really high quality implementation of ScalarArrayOpExpr +
prefetching can keep the system busy with useful work, which is less
true with other types of queries, which have inherently less
predictable I/O (and often have less I/O overall). What could be more
amenable to predicting I/O patterns than a query with a large IN()
list, with many constants that can be processed in whatever order
makes sense at runtime?

What I'd like to do with ScalarArrayOpExpr is to teach nbtree to
coalesce together those "small index scans" into "medium index scans"
dynamically, where that makes sense. That's the main part that's
missing right now. Dynamic behavior matters a lot with
ScalarArrayOpExpr stuff -- that's where the challenge lies, but also
where the opportunities are. Prefetching builds on all that.

> I guess I need to do some testing on a range of data sets / queries, and
> see how it works in practice.

If I can figure out a way of getting ScalarArrayOpExpr to visit each
leaf page exactly once, that might be enough to make things work
really well most of the time. Maybe it won't even be necessary to
coordinate very much, in the end. Unsure.

I've already done a lot of work that tries to minimize the chances of
regular (non-ScalarArrayOpExpr) queries accessing more than a single
leaf page, which will help your strategy of just prefetching items
from a single leaf page at a time -- that will get you pretty far
already. Consider the example of the tenk2_hundred index from the
bt_page_items documentation. You'll notice that the high key for the
page shown in the docs (and every other page in the same index) nicely
makes the leaf page boundaries "aligned" with natural keyspace
boundaries, due to suffix truncation. That helps index scans to access
no more than a single leaf page when accessing any one distinct
"hundred" value.

We are careful to do the right thing with the "boundary cases" when we
descend the tree, too. This _bt_search behavior builds on the way that
suffix truncation influences the on-disk structure of indexes. Queries
such as "select * from tenk2 where hundred = ?" will each return 100
rows spread across almost as many heap pages. That's a fairly large
number of rows/heap pages, but we still only need to access one leaf
page for every possible constant value (every "hundred" value that
might be specified as the ? in my point query example). It doesn't
matter if it's the leftmost or rightmost item on a leaf page -- we
always descend to exactly the correct leaf page directly, and we
always terminate the scan without having to move to the right sibling
page (we check the high key before going to the right page in some
cases, per the optimization added by commit 29b64d1d).

The same kind of behavior is also seen with the TPC-C line items
primary key index, which is a composite index. We want to access the
items from a whole order in one go, from one leaf page -- and we
reliably do the right thing there too (though with some caveats about
CREATE INDEX). We should never have to access more than one leaf page
to read a single order's line items. This matters because it's quite
natural to want to access whole orders with that particular
table/workload (it's also unnatural to 

Re: index prefetching

2023-06-09 Thread Tomas Vondra



On 6/9/23 01:38, Peter Geoghegan wrote:
> On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
>  wrote:
>> Normal index scans are an even more interesting case but I'm not
>> sure how hard it would be to get that information. It may only be
>> convenient to get the blocks from the last leaf page we looked at,
>> for example.
>>
>> So this suggests we simply started prefetching for the case where the
>> information was readily available, and it'd be harder to do for index
>> scans so that's it.
> 
> What the exact historical timeline is may not be that important. My
> emphasis on ScalarArrayOpExpr is partly due to it being a particularly
> compelling case for both parallel index scan and prefetching, in
> general. There are many queries that have huge in() lists that
> naturally benefit a great deal from prefetching. Plus they're common.
> 

Did you mean parallel index scan or bitmap index scan?

But yeah, I get the point that SAOP queries are an interesting example
of queries to explore. I'll add some to the next round of tests.

>> Even if SAOP (probably) wasn't the reason, I think you're right it may
>> be an issue for prefetching, causing regressions. It didn't occur to me
>> before, because I'm not that familiar with the btree code and/or how it
>> deals with SAOP (and didn't really intend to study it too deeply).
> 
> I'm pretty sure that you understand this already, but just in case:
> ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
> page" in many important cases. Not really -- not in the sense that
> you'd hope and expect. We're senselessly processing the same index
> leaf page multiple times and treating it as a different, independent
> leaf page. That makes heap prefetching of the kind you're working on
> utterly hopeless, since it effectively throws away lots of useful
> context. Obviously that's the fault of nbtree ScalarArrayOpExpr
> handling, not the fault of your patch.
> 

I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?

I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.

>> So if you're planning to work on this for PG17, collaborating on it
>> would be great.
>>
>> For now I plan to just ignore SAOP, or maybe just disabling prefetching
>> for SAOP index scans if it proves to be prone to regressions. That's not
>> great, but at least it won't make matters worse.
> 
> Makes sense, but I hope that it won't come to that.
> 
> IMV it's actually quite reasonable that you didn't expect to have to
> think about ScalarArrayOpExpr at all -- it would make a lot of sense
> if that was already true. But the fact is that it works in a way
> that's pretty silly and naive right now, which will impact
> prefetching. I wasn't really thinking about regressions, though. I was
> actually more concerned about missing opportunities to get the most
> out of prefetching. ScalarArrayOpExpr really matters here.
> 

OK

>> I guess something like this might be a "nice" bad case:
>>
>> insert into btree_test mod(i,10), md5(i::text)
>>   from generate_series(1, $ROWS) s(i)
>>
>> select * from btree_test where a in (999, 1000, 1001, 1002)
>>
>> The values are likely colocated on the same heap page, the bitmap scan
>> is going to do a single prefetch. With index scan we'll prefetch them
>> repeatedly. I'll give it a try.
> 
> This is the sort of thing that I was thinking of. What are the
> conditions under which bitmap index scan starts to make sense? Why is
> the break-even point whatever it is in each case, roughly? And, is it
> actually because of laws-of-physics level trade-off? Might it not be
> due to implementation-level issues that are much less fundamental? In
> other words, might it actually be that we're just doing something
> stoopid in the case of plain index scans? Something that is just
> papered-over by bitmap index scans right now?
> 

Yeah, that's partially why I do this kind of testing on a wide range of
synthetic data sets - to find cases that behave in unexpected way (say,
seem like they should improve but don't).

> I see that your patch has logic that avoids repeated prefetching of
> the same block -- plus you have comments that wonder about going
> further by adding a "small lru array" in your new index_prefetch()
> function. I asked you about this during the unconference presentation.
> But I think that my understanding of the situation was slightly
> different to yours. That's relevant here.
> 
> I wonder if you should go further than this, by actually sorting the
> items that you need to fetch as part of processing a given leaf page
> (I said this at the unconference, you may recall). Why should we
> *ever* pin/access the same heap page more than once per leaf 

Re: index prefetching

2023-06-09 Thread Tomas Vondra
On 6/9/23 02:06, Andres Freund wrote:
> Hi,
> 
> On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
>> At pgcon unconference I presented a PoC patch adding prefetching for
>> indexes, along with some benchmark results demonstrating the (pretty
>> significant) benefits etc. The feedback was quite positive, so let me
>> share the current patch more widely.
> 
> I'm really excited about this work.
> 
> 
>> 1) pairing-heap in GiST / SP-GiST
>>
>> For most AMs, the index state is pretty trivial - matching items from a
>> single leaf page. Prefetching that is pretty trivial, even if the
>> current API is a bit cumbersome.
>>
>> Distance queries on GiST and SP-GiST are a problem, though, because
>> those do not just read the pointers into a simple array, as the distance
>> ordering requires passing stuff through a pairing-heap :-(
>>
>> I don't know how to best deal with that, especially not in the simple
>> API. I don't think we can "scan forward" stuff from the pairing heap, so
>> the only idea I have is actually having two pairing-heaps. Or maybe
>> using the pairing heap for prefetching, but stashing the prefetched
>> pointers into an array and then returning stuff from it.
>>
>> In the patch I simply prefetch items before we add them to the pairing
>> heap, which is good enough for demonstrating the benefits.
> 
> I think it'd be perfectly fair to just not tackle distance queries for now.
> 

My concern is that if we cut this from v0 entirely, we'll end up with an
API that'll not be suitable for adding distance queries later.

> 
>> 2) prefetching from executor
>>
>> Another question is whether the prefetching shouldn't actually happen
>> even higher - in the executor. That's what Andres suggested during the
>> unconference, and it kinda makes sense. That's where we do prefetching
>> for bitmap heap scans, so why should this happen lower, right?
> 
> Yea. I think it also provides potential for further optimizations in the
> future to do it at that layer.
> 
> One thing I have been wondering around this is whether we should not have
> split the code for IOS and plain indexscans...
> 

Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?

> 
>> 4) per-leaf prefetching
>>
>> The code is restricted only prefetches items from one leaf page. If the
>> index scan needs to scan multiple (many) leaf pages, we have to process
>> the first leaf page first before reading / prefetching the next one.
>>
>> I think this is acceptable limitation, certainly for v0. Prefetching
>> across multiple leaf pages seems way more complex (particularly for the
>> cases using pairing heap), so let's leave this for the future.
> 
> Hm. I think that really depends on the shape of the API we end up with. If we
> move the responsibility more twoards to the executor, I think it very well
> could end up being just as simple to prefetch across index pages.
> 

Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).

> 
>> 5) index-only scans
>>
>> I'm not sure what to do about index-only scans. On the one hand, the
>> point of IOS is not to read stuff from the heap at all, so why prefetch
>> it. OTOH if there are many allvisible=false pages, we still have to
>> access that. And if that happens, this leads to the bizarre situation
>> that IOS is slower than regular index scan. But to address this, we'd
>> have to consider the visibility during prefetching.
> 
> That should be easy to do, right?
> 

It doesn't seem particularly complicated (famous last words), and we
need to do the VM checks anyway so it seems like it wouldn't add a lot
of overhead either

> 
> 
>> Benchmark / TPC-H
>> -
>>
>> I ran the 22 queries on 100GB data set, with parallel query either
>> disabled or enabled. And I measured timing (and speedup) for each query.
>> The speedup results look like this (see the attached PDF for details):
>>
>> queryserialparallel
>> 1  101% 99%
>> 2  119%100%
>> 3  100% 99%
>> 4  101%100%
>> 5  101%100%
>> 6   12% 99%
>> 7  100%100%
>> 8   52% 67%
>> 10 102%101%
>> 11 100% 72%
>> 12 101%100%
>> 13 100%101%
>> 14  13%100%
>> 15 101%100%
>> 16  99% 99%
>> 17  95%101%
>> 18 101%106%
>> 19  30% 40%
>> 20  99%100%
>> 21 101%100%
>> 22 101%107%
>>
>> The percentage is (timing patched / master, so <100% means faster, >100%
>> means slower).
>>
>> The different queries are affected depending on the query plan - many
>> queries are close to 100%, which means "no difference". For the serial
>> case, 

Re: index prefetching

2023-06-08 Thread Peter Geoghegan
On Thu, Jun 8, 2023 at 4:38 PM Peter Geoghegan  wrote:
> This is conceptually a "mini bitmap index scan", though one that takes
> place "inside" a plain index scan, as it processes one particular leaf
> page. That's the kind of design that "plain index scan vs bitmap index
> scan as a continuum" leads me to (a little like the continuum between
> nested loop joins, block nested loop joins, and merge joins). I bet it
> would be practical to do things this way, and help a lot with some
> kinds of queries. It might even be simpler than avoiding excessive
> prefetching using an LRU cache thing.

I'll now give a simpler (though less realistic) example of a case
where "mini bitmap index scan" would be expected to help index scans
in general, and prefetching during index scans in particular.
Something very simple:

create table bitmap_parity_test(randkey int4, filler text);
create index on bitmap_parity_test (randkey);
insert into bitmap_parity_test select (random()*1000),
repeat('filler',10) from generate_series(1,250) i;

This gives me a table with 4 pages, and an index with 2 pages.

The following query selects about half of the rows from the table:

select * from bitmap_parity_test where randkey < 500;

If I force the query to use a bitmap index scan, I see that the total
number of buffers hit is exactly as expected (according to
EXPLAIN(ANALYZE,BUFFERS), that is): there are 5 buffers/pages hit. We
need to access every single heap page once, and we need to access the
only leaf page in the index once.

I'm sure that you know where I'm going with this already. I'll force
the same query to use a plain index scan, and get a very different
result. Now EXPLAIN(ANALYZE,BUFFERS) shows that there are a total of
89 buffers hit -- 88 of which must just be the same 5 heap pages,
again and again. That's just silly. It's probably not all that much
slower, but it's not helping things. And it's likely that this effect
interferes with the prefetching in your patch.

Obviously you can come up with a variant of this test case where
bitmap index scan does way fewer buffer accesses in a way that really
makes sense -- that's not in question. This is a fairly selective
index scan, since it only touches one index page -- and yet we still
see this difference.

(Anybody pedantic enough to want to dispute whether or not this index
scan counts as "selective" should run "insert into bitmap_parity_test
select i, repeat('actshually',10)  from generate_series(2000,1e5) i"
before running the "randkey < 500" query, which will make the index
much larger without changing any of the details of how the query pins
pages -- non-pedants should just skip that step.)

-- 
Peter Geoghegan




Re: index prefetching

2023-06-08 Thread Andres Freund
Hi,

On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.

I'm really excited about this work.


> 1) pairing-heap in GiST / SP-GiST
> 
> For most AMs, the index state is pretty trivial - matching items from a
> single leaf page. Prefetching that is pretty trivial, even if the
> current API is a bit cumbersome.
> 
> Distance queries on GiST and SP-GiST are a problem, though, because
> those do not just read the pointers into a simple array, as the distance
> ordering requires passing stuff through a pairing-heap :-(
> 
> I don't know how to best deal with that, especially not in the simple
> API. I don't think we can "scan forward" stuff from the pairing heap, so
> the only idea I have is actually having two pairing-heaps. Or maybe
> using the pairing heap for prefetching, but stashing the prefetched
> pointers into an array and then returning stuff from it.
> 
> In the patch I simply prefetch items before we add them to the pairing
> heap, which is good enough for demonstrating the benefits.

I think it'd be perfectly fair to just not tackle distance queries for now.


> 2) prefetching from executor
> 
> Another question is whether the prefetching shouldn't actually happen
> even higher - in the executor. That's what Andres suggested during the
> unconference, and it kinda makes sense. That's where we do prefetching
> for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...


> 4) per-leaf prefetching
> 
> The code is restricted only prefetches items from one leaf page. If the
> index scan needs to scan multiple (many) leaf pages, we have to process
> the first leaf page first before reading / prefetching the next one.
> 
> I think this is acceptable limitation, certainly for v0. Prefetching
> across multiple leaf pages seems way more complex (particularly for the
> cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.


> 5) index-only scans
> 
> I'm not sure what to do about index-only scans. On the one hand, the
> point of IOS is not to read stuff from the heap at all, so why prefetch
> it. OTOH if there are many allvisible=false pages, we still have to
> access that. And if that happens, this leads to the bizarre situation
> that IOS is slower than regular index scan. But to address this, we'd
> have to consider the visibility during prefetching.

That should be easy to do, right?



> Benchmark / TPC-H
> -
> 
> I ran the 22 queries on 100GB data set, with parallel query either
> disabled or enabled. And I measured timing (and speedup) for each query.
> The speedup results look like this (see the attached PDF for details):
> 
> queryserialparallel
> 1  101% 99%
> 2  119%100%
> 3  100% 99%
> 4  101%100%
> 5  101%100%
> 6   12% 99%
> 7  100%100%
> 8   52% 67%
> 10 102%101%
> 11 100% 72%
> 12 101%100%
> 13 100%101%
> 14  13%100%
> 15 101%100%
> 16  99% 99%
> 17  95%101%
> 18 101%106%
> 19  30% 40%
> 20  99%100%
> 21 101%100%
> 22 101%107%
> 
> The percentage is (timing patched / master, so <100% means faster, >100%
> means slower).
> 
> The different queries are affected depending on the query plan - many
> queries are close to 100%, which means "no difference". For the serial
> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
> while for the parallel case the benefits are somewhat less significant.
> 
> My explanation is that either (a) parallel case used a different plan
> with fewer index scans or (b) the parallel query does more concurrent
> I/O simply by using parallel workers. Or maybe both.
> 
> There are a couple regressions too, I believe those are due to doing too
> much prefetching in some cases, and some of the heuristics mentioned
> earlier should eliminate most of this, I think.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?

Re: index prefetching

2023-06-08 Thread Peter Geoghegan
On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
 wrote:
> Normal index scans are an even more interesting case but I'm not
> sure how hard it would be to get that information. It may only be
> convenient to get the blocks from the last leaf page we looked at,
> for example.
>
> So this suggests we simply started prefetching for the case where the
> information was readily available, and it'd be harder to do for index
> scans so that's it.

What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.

> Even if SAOP (probably) wasn't the reason, I think you're right it may
> be an issue for prefetching, causing regressions. It didn't occur to me
> before, because I'm not that familiar with the btree code and/or how it
> deals with SAOP (and didn't really intend to study it too deeply).

I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.

> So if you're planning to work on this for PG17, collaborating on it
> would be great.
>
> For now I plan to just ignore SAOP, or maybe just disabling prefetching
> for SAOP index scans if it proves to be prone to regressions. That's not
> great, but at least it won't make matters worse.

Makes sense, but I hope that it won't come to that.

IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.

> I guess something like this might be a "nice" bad case:
>
> insert into btree_test mod(i,10), md5(i::text)
>   from generate_series(1, $ROWS) s(i)
>
> select * from btree_test where a in (999, 1000, 1001, 1002)
>
> The values are likely colocated on the same heap page, the bitmap scan
> is going to do a single prefetch. With index scan we'll prefetch them
> repeatedly. I'll give it a try.

This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?

I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.

I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.

This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching 

Re: index prefetching

2023-06-08 Thread Tomas Vondra
On 6/8/23 20:56, Peter Geoghegan wrote:
> On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
>  wrote:
>> We already do prefetching for bitmap index scans, where the bitmap heap
>> scan prefetches future pages based on effective_io_concurrency. I'm not
>> sure why exactly was prefetching implemented only for bitmap scans, but
>> I suspect the reasoning was that it only helps when there's many
>> matching tuples, and that's what bitmap index scans are for. So it was
>> not worth the implementation effort.
> 
> I have an educated guess as to why prefetching was limited to bitmap
> index scans this whole time: it might have been due to issues with
> ScalarArrayOpExpr quals.
> 
> Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
> "natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
> were supported by both index scans and index-only scans -- not just
> bitmap scans, which could handle ScalarArrayOpExpr quals even without
> nbtree directly understanding them. The commit was in late 2011,
> shortly after the introduction of index-only scans -- which seems to
> have been the real motivation. And so it seems to me that support for
> ScalarArrayOpExpr was built with bitmap scans and index-only scans in
> mind. Plain index scan ScalarArrayOpExpr quals do work, but support
> for them seems kinda perfunctory to me (maybe you can think of a
> specific counter-example where plain index scans really benefit from
> ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
> original motivation).
>
I don't think SAOP is the reason. I did a bit of digging in the list
archives, and found thread [1], which says:

Regardless of what mechanism is used and who is responsible for
doing it someone is going to have to figure out which blocks are
specifically interesting to prefetch. Bitmap index scans happen
to be the easiest since we've already built up a list of blocks
we plan to read. Somehow that information has to be pushed to the
storage manager to be acted upon.

Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.

So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.

There's a couple more ~2008 threads mentioning prefetching, bitmap scans
and even regular index scans (like [2]). None of them even mentions SAOP
stuff at all.

[1]
https://www.postgresql.org/message-id/871wa17vxb.fsf%40oxford.xeocode.com

[2]
https://www.postgresql.org/message-id/87wsnnz046.fsf%40oxford.xeocode.com

> ScalarArrayOpExpr for plain index scans don't really make that much
> sense right now because there is no heap prefetching in the index scan
> case, which is almost certainly going to be the major bottleneck
> there. At the same time, adding useful prefetching for
> ScalarArrayOpExpr execution more or less requires that you first
> improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
> in mind that ScalarArrayOpExpr execution (whether for bitmap index
> scans or index scans) is related to skip scan/MDAM techniques -- so
> there are tricky dependencies that need to be considered together.
> 
> Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
> descend the B-Tree for each array constant -- even though in principle
> we could avoid all that work in cases that happen to have locality. In
> other words we'll often descend the tree multiple times and land on
> exactly the same leaf page again and again, without ever noticing that
> we could have gotten away with only descending the tree once (it'd
> also be possible to start the next "descent" one level up, not at the
> root, intelligently reusing some of the work from an initial descent
> -- but you don't need anything so fancy to greatly improve matters
> here).
> 
> This lack of smarts around how many times we call _bt_first() to
> descend the index is merely a silly annoyance when it happens in
> btgetbitmap(). We do at least sort and deduplicate the array up-front
> (inside _bt_sort_array_elements()), so there will be significant
> locality of access each time we needlessly descend the tree.
> Importantly, there is no prefetching "pipeline" to mess up in the
> bitmap index scan case -- since that all happens later on. Not so for
> the superficially similar (though actually rather different) plain
> index scan case -- at least not once you add prefetching. If you're
> uselessly processing the same leaf page multiple times, then there is
> no way that heap prefetching can notice that it should be batching
> things up. The context that would allow prefetching to work well isn't
> really available right now. So the plain index scan case is kinda at a
> gratuitous disadvantage (with prefetching) relative to the bitmap
> 

Re: index prefetching

2023-06-08 Thread Peter Geoghegan
On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
 wrote:
> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans, but
> I suspect the reasoning was that it only helps when there's many
> matching tuples, and that's what bitmap index scans are for. So it was
> not worth the implementation effort.

I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.

Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).

ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.

Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).

This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.

Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.

> But there's three shortcomings in logic:
>
> 1) It's not clear the thresholds for prefetching being beneficial and
> switching to bitmap index scans are the same value. And as I'll
> demonstrate later, the prefetching threshold is indeed much lower
> (perhaps a couple dozen matching tuples) on large tables.

As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for