Testing Index Skip scan
I have been interested in a query that returns a batch of results filtered by a subset of the first column of an index and ordered by the second. I created a simple (hopefully) reproducible example of the issue, the two queries describe the same data but have very different costs (explain output included in the attached file). server_version 12.8 On slack #pgsql-hackers channel, @sfrost, it was suggested that what I described is achieved by index skip scan. How can I get a development build to test this feature? create table ordered_skip_index_scan_test (id serial not null constraint ordered_skip_index_scan_test_pkey primary key, col1 INT4, col2 INT4); insert into ordered_skip_index_scan_test (col1, col2) select trunc(sqrt(random()*1e6)) as col1, (random()*1e6)::int as col2 from generate_series(1, 11e6); create index ordered_skip_index_col2 on ordered_skip_index_scan_test using btree (col2) create index ordered_skip_index_col1_col2 on ordered_skip_index_scan_test using btree (col1, col2) analyse ordered_skip_index_scan_test; select count(*) from ordered_skip_index_scan_test; explain (analyse, buffers) select * from ordered_skip_index_scan_test where col2 > 5 and col1 in (1, 999) order by col2 limit 10; -- Limit (cost=0.43..263.30 rows=10 width=12) (actual time=0.083..5.668 rows=10 loops=1) -- Buffers: shared hit=5362 -- -> Index Scan using ordered_skip_index_col2 on ordered_skip_index_scan_test (cost=0.43..492524.49 rows=18737 width=12) (actual time=0.082..5.664 rows=10 loops=1) -- Index Cond: (col2 > 5) -- Filter: (col1 = ANY ('{1,999}'::integer[])) -- Rows Removed by Filter: 5343 -- Buffers: shared hit=5362 -- Planning Time: 0.135 ms -- Execution Time: 5.693 ms explain (analyse, buffers) with t as ((select * from ordered_skip_index_scan_test where col2 between 5 and 50 and col1 = 13 limit 10) union (select * from ordered_skip_index_scan_test -- Not taking advantage of the partial -- results to adjust the bounds of the -- subsequent queries where col2 between 5 and 50 and col1 = 8 limit 10) union (select * from ordered_skip_index_scan_test where col2 between 5 and 50 and col1 = 5 limit 10) union (select * from ordered_skip_index_scan_test where col2 between 5 and 50 and col1 = 845 limit 10)) select * from t order by col2 limit 10; -- Limit (cost=159.95..159.97 rows=10 width=12) (actual time=0.225..0.230 rows=10 loops=1) -- Buffers: shared hit=52 -- -> Sort (cost=159.95..160.05 rows=40 width=12) (actual time=0.224..0.227 rows=10 loops=1) -- Sort Key: ordered_skip_index_scan_test.col2 -- Sort Method: top-N heapsort Memory: 25kB -- Buffers: shared hit=52 -- -> HashAggregate (cost=158.28..158.68 rows=40 width=12) (actual time=0.167..0.178 rows=40 loops=1) -- Group Key: ordered_skip_index_scan_test.id, ordered_skip_index_scan_test.col1, ordered_skip_index_scan_test.col2 -- Batches: 1 Memory Usage: 24kB -- Buffers: shared hit=52 -- -> Append (cost=0.43..157.98 rows=40 width=12) (actual time=0.026..0.144 rows=40 loops=1) -- Buffers: shared hit=52 -- -> Limit (cost=0.43..39.35 rows=10 width=12) (actual time=0.026..0.044 rows=10 loops=1) -- Buffers: shared hit=13 -- -> Index Scan using ordered_skip_index_col1_col2 on ordered_skip_index_scan_test (cost=0.43..17214.39 rows=4424 width=12) (actual time=0.025..0.042 rows=10 loops=1) -- Index Cond: ((col1 = 13) AND (col2 >= 5) AND (col2 <= 50)) -- Buffers: shared hit=13 -- -> Limit (cost=0.43..39.35 rows=10 width=12) (actual time=0.011..0.028 rows=10 loops=1) -- Buffers: shared hit=13 -- -> Index Scan using ordered_skip_index_col1_col2 on ordered_skip_index_scan_test ordered_skip_index_scan_test_1 (cost=0.43..17214.39 rows=4424 width=12) (actual time=0.011..0.026 rows=10 loops=1) -- Index Cond: ((col1 = 8) AND (col2 >= 5) AND (col2 <= 50)) -- Buffers: shared hit=13 -- -> Limit (cost=0.43..39.35 rows=10 width=12) (actual time=0.010..0.025 rows=10 loops=1) -- Buffers: shared hit=13 -- -> Index Scan using ordered_skip_index_col1_col2 on ordered_skip_index_scan_test ordered_skip_index_scan_test_2 (cost=0.43..17214.39 rows=4424 width=12) (actual time=0.010..0.023 rows=10 loops=1) -- Index Cond: ((col1 = 5) AND (col2 >= 5) AND (col2 <= 50))
Re: MDAM techniques and Index Skip Scan patch
On Mon, Mar 28, 2022 at 7:07 PM Tom Lane wrote: > Right, that's the case I had in mind --- apologies if my terminology > was faulty. btree can actually handle such a case now, but what it > fails to do is re-descend from the tree root instead of plowing > forward in the index to find the next matching entry. KNNGIST seems vaguely related to what we'd build for nbtree skip scan, though. GiST index scans are "inherently loose", though. KNNGIST uses a pairing heap/priority queue, which seems like the kind of thing nbtree skip scan can avoid. > +1. We at least need to be sure we all are using these terms > the same way. Yeah, there are *endless* opportunities for confusion here. -- Peter Geoghegan
Re: MDAM techniques and Index Skip Scan patch
Peter Geoghegan writes: > The terminology in this area is a mess. MySQL calls > SELECT-DISTINCT-that-matches-an-index "loose index scans". I think > that you're talking about skip scan when you say "loose index scan". > Skip scan is where there is an omitted prefix of columns in the SQL > query -- omitted columns after the first column that lack an equality > qual. Right, that's the case I had in mind --- apologies if my terminology was faulty. btree can actually handle such a case now, but what it fails to do is re-descend from the tree root instead of plowing forward in the index to find the next matching entry. > It might be useful for somebody to go write a "taxonomy of MDAM > techniques", or a glossary. +1. We at least need to be sure we all are using these terms the same way. regards, tom lane
Re: MDAM techniques and Index Skip Scan patch
On Mon, Mar 28, 2022 at 5:21 PM Tom Lane wrote: > In any case, what I was on about is _bt_preprocess_keys() and > adjacent code. I'm surprised that those aren't more expensive > than one palloc in _bt_first. Maybe that logic falls through very > quickly in simple cases, though. I assume that it doesn't really appear in very simple cases (also common cases). But delaying the scan setup work until execution time does seem ugly. That's probably a good enough reason to refactor. -- Peter Geoghegan
Re: MDAM techniques and Index Skip Scan patch
On Tue, Mar 22, 2022 at 1:55 PM Tom Lane wrote: > Peter asked me off-list to spend some time thinking about the overall > direction we ought to be pursuing here. Thanks for taking a look! "5.5 Exploiting Key Prefixes" and "5.6 Ordered Retrieval" from "Modern B-Tree Techniques" are also good, BTW. The terminology in this area is a mess. MySQL calls SELECT-DISTINCT-that-matches-an-index "loose index scans". I think that you're talking about skip scan when you say "loose index scan". Skip scan is where there is an omitted prefix of columns in the SQL query -- omitted columns after the first column that lack an equality qual. Pretty sure that MySQL/InnoDB can't do that -- it can only "skip" to the extent required to make SELECT-DISTINCT-that-matches-an-index perform well, but that's about it. It might be useful for somebody to go write a "taxonomy of MDAM techniques", or a glossary. The existing "Loose indexscan" Postgres wiki page doesn't seem like enough. Something very high level and explicit, with examples, just so we don't end up talking at cross purposes too much. > 1. Usually I'm in favor of doing this sort of thing in an index AM > agnostic way, but here I don't see much point. All of the ideas at > stake rely fundamentally on having a lexicographically-ordered multi > column index; but we don't have any of those except btree, nor do > I think we're likely to get any soon. This motivates the general > tenor of my remarks below, which is "do it in access/nbtree/ not in > the planner". That was my intuition all along, but I didn't quite have the courage to say so -- sounds too much like something that an optimizer dilettante like me would be expected to say. :-) Seems like one of those things where lots of high level details intrinsically need to be figured out on-the-fly, at execution time, rather than during planning. Perhaps it'll be easier to correctly determine that a skip scan plan is the cheapest in practice than to accurately cost skip scan plans. If the only alternative is a sequential scan, then perhaps a very approximate cost model will work well enough. It's probably way too early to tell right now, though. > I've worried in the past that we'd > need planner/statistical support to figure out whether a loose scan > is likely to be useful compared to just plowing ahead in the index. I don't expect to be able to come up with a structure that leaves no unanswered questions about future MDAM work -- it's not realistic to expect everything to just fall into place. But that's okay. Just having everybody agree on roughly the right conceptual model is the really important thing. That now seems quite close, which I count as real progress. > 4. I find each of the above ideas to be far more attractive than > optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really > understand why the current patchsets seem to be driven largely > by that single use-case. I wouldn't even bother with that for the > initial patch. I absolutely agree. I wondered about that myself in the past. My best guess is that a certain segment of users are familiar with SELECT-DISTINCT-that-matches-an-index from MySQL. And so to some extent application frameworks evolved in a world where that capability existed. IIRC Jesper once said that Hibernate relied on this capability. It's probably a lot easier to implement SELECT-DISTINCT-that-matches-an-index if you have the MySQL storage engine model, with concurrency control that's typically based on two-phase locking. I think that MySQL does some amount of deduplication in its executor here -- and *not* in what they call the storage engine. This is exactly what I'd like to avoid in Postgres; as I said "Maintenance of Index Order" (as the paper calls it) seems important, and not something to be added later on. Optimizer and executor layers that each barely know the difference between a skip scan and a full index scan seems like something we might actually want to aim for, rather than avoid. Teaching nbtree to transform quals into ranges sounds odd at first, but it seems like the right approach now, on balance -- that's the only *good* way to maintain index order. (Maintaining index order is needed to avoid needing or relying on deduplication in the executor proper, which is even inappropriate in an implementation of SELECT-DISTINCT-that-matches-an-index IMO.) -- Peter Geoghegan
Re: MDAM techniques and Index Skip Scan patch
Peter Geoghegan writes: > We could get rid of dynamic allocations for BTStackData in > _bt_first(), perhaps. The problem is that there is no simple, > reasonable proof of the maximum height on a B-tree, even though a > B-Tree with more than 7 or 8 levels seems extraordinarily unlikely. Start with a few entries preallocated, and switch to dynamically allocated space if there turn out to be more levels than that, perhaps? Not sure if it's worth the trouble. In any case, what I was on about is _bt_preprocess_keys() and adjacent code. I'm surprised that those aren't more expensive than one palloc in _bt_first. Maybe that logic falls through very quickly in simple cases, though. regards, tom lane
Re: MDAM techniques and Index Skip Scan patch
On Tue, Mar 22, 2022 at 4:06 PM Andres Freund wrote: > Are you thinking of just moving the setup stuff in nbtree (presumably parts of > _bt_first() / _bt_preprocess_keys()) or also stuff in > ExecIndexBuildScanKeys()? > > The latter does show up a bit more heavily in profiles than nbtree specific > setup, and given that it's generic executor type stuff, seems even more > amenable to being moved to plan time. When I was working on the patch series that became the nbtree Postgres 12 work, this came up. At one point I discovered that using palloc0() for the insertion scankey in _bt_first() was a big problem with nested loop joins -- it became a really noticeable bottleneck with one of my test cases. I independently discovered what Tom must have figured out back in 2005, when he committed d961a56899. This led to my making the new insertion scan key structure (BTScanInsertData) not use dynamic allocation. So _bt_first() is definitely performance critical for certain types of queries. We could get rid of dynamic allocations for BTStackData in _bt_first(), perhaps. The problem is that there is no simple, reasonable proof of the maximum height on a B-tree, even though a B-Tree with more than 7 or 8 levels seems extraordinarily unlikely. You could also invent a slow path (maybe do what we do in _bt_insert_parent() in the event of a concurrent root page split/NULL stack), but that runs into the problem of being awkward to test, and pretty ugly. It's doable, but I wouldn't do it unless there was a pretty noticeable payoff. -- Peter Geoghegan
Re: Index Skip Scan (new UniqueKeys)
On 6/9/20 06:22, Dmitry Dolgov wrote: Here is a new version of index skip scan patch, based on v8 patch for UniqueKeys implementation from [1]. I want to start a new thread to simplify navigation, hopefully I didn't forget anyone who actively participated in the discussion. This CommitFest entry has been closed with RwF at [1]. Thanks for all the feedback given ! [1] https://www.postgresql.org/message-id/ab8636e7-182f-886a-3a39-f3fc279ca45d%40redhat.com Best regards, Dmitry & Jesper
Re: MDAM techniques and Index Skip Scan patch
On 3/23/22 18:22, Dmitry Dolgov wrote: The CF item could be set to RwF, what would you say, Jesper? We want to thank the community for the feedback that we have received over the years for this feature. Hopefully a future implementation can use Tom's suggestions to get closer to a committable solution. Here is the last CommitFest entry [1] for the archives. RwF [1] https://commitfest.postgresql.org/37/1741/ Best regards, Dmitry & Jesper
Re: MDAM techniques and Index Skip Scan patch
> On Wed, Mar 23, 2022 at 05:32:46PM -0400, Tom Lane wrote: > Dmitry Dolgov <9erthali...@gmail.com> writes: > > On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote: > >> In short: I would throw out just about all the planner infrastructure > >> that's been proposed so far. It looks bulky, expensive, and > >> drastically undercommented, and I don't think it's buying us anything > >> of commensurate value. > > > Broadly speaking planner related changes proposed in the patch so far > > are: UniqueKey, taken from the neighbour thread about select distinct; > > list of uniquekeys to actually pass information about the specified > > loose scan prefix into nbtree; some verification logic to prevent > > applying skipping when it's not supported. I can imagine taking out > > UniqueKeys and passing loose scan prefix in some other form (the other > > parts seems to be essential) -- is that what you mean? > > My point is that for pure loose scans --- that is, just optimizing a scan, > not doing AM-based duplicate-row-elimination --- you do not need to pass > any new data to btree at all. It can infer what to do on the basis of the > set of index quals it's handed. > > The bigger picture here is that I think the reason this patch series has > failed to progress is that it's too scattershot. You need to pick a > minimum committable feature and get that done, and then you can move on > to the next part. I think the minimum committable feature is loose scans, > which will require a fair amount of work in access/nbtree/ but very little > new planner code, and will be highly useful in their own right even if we > never do anything more. > > In general I feel that the UniqueKey code is a solution looking for a > problem, and that treating it as the core of the patchset is a mistake. > We should be driving this work off of what nbtree needs to make progress, > and not building more infrastructure elsewhere than we have to. Maybe > we'll end up with something that looks like UniqueKeys, but I'm far from > convinced of that. I see. I'll need some thinking time about how it may look like (will probably return with more questions). The CF item could be set to RwF, what would you say, Jesper?
Re: MDAM techniques and Index Skip Scan patch
Dmitry Dolgov <9erthali...@gmail.com> writes: > On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote: >> In short: I would throw out just about all the planner infrastructure >> that's been proposed so far. It looks bulky, expensive, and >> drastically undercommented, and I don't think it's buying us anything >> of commensurate value. > Broadly speaking planner related changes proposed in the patch so far > are: UniqueKey, taken from the neighbour thread about select distinct; > list of uniquekeys to actually pass information about the specified > loose scan prefix into nbtree; some verification logic to prevent > applying skipping when it's not supported. I can imagine taking out > UniqueKeys and passing loose scan prefix in some other form (the other > parts seems to be essential) -- is that what you mean? My point is that for pure loose scans --- that is, just optimizing a scan, not doing AM-based duplicate-row-elimination --- you do not need to pass any new data to btree at all. It can infer what to do on the basis of the set of index quals it's handed. The bigger picture here is that I think the reason this patch series has failed to progress is that it's too scattershot. You need to pick a minimum committable feature and get that done, and then you can move on to the next part. I think the minimum committable feature is loose scans, which will require a fair amount of work in access/nbtree/ but very little new planner code, and will be highly useful in their own right even if we never do anything more. In general I feel that the UniqueKey code is a solution looking for a problem, and that treating it as the core of the patchset is a mistake. We should be driving this work off of what nbtree needs to make progress, and not building more infrastructure elsewhere than we have to. Maybe we'll end up with something that looks like UniqueKeys, but I'm far from convinced of that. regards, tom lane
Re: MDAM techniques and Index Skip Scan patch
> On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote: > Peter Geoghegan writes: > > Like many difficult patches, the skip scan patch is not so much > > troubled by problems with the implementation as it is troubled by > > *ambiguity* about the design. Particularly concerning how skip scan > > meshes with existing designs, as well as future designs -- > > particularly designs for other MDAM techniques. I've started this > > thread to have a big picture conversation about how to think about > > these things. > > Peter asked me off-list to spend some time thinking about the overall > direction we ought to be pursuing here. I have done that, and here > are a few modest suggestions. Thanks. To make sure I understand your proposal better, I have a couple of questions: > In short: I would throw out just about all the planner infrastructure > that's been proposed so far. It looks bulky, expensive, and > drastically undercommented, and I don't think it's buying us anything > of commensurate value. Broadly speaking planner related changes proposed in the patch so far are: UniqueKey, taken from the neighbour thread about select distinct; list of uniquekeys to actually pass information about the specified loose scan prefix into nbtree; some verification logic to prevent applying skipping when it's not supported. I can imagine taking out UniqueKeys and passing loose scan prefix in some other form (the other parts seems to be essential) -- is that what you mean?
Re: MDAM techniques and Index Skip Scan patch
Andres Freund writes: > On 2022-03-22 16:55:49 -0400, Tom Lane wrote: >> BTW, I've had a bee in my bonnet for a long time about whether some of >> nbtree's scan setup work could be done once during planning, rather than >> over again during each indexscan start. > It does show up in simple-index-lookup heavy workloads. Not as a major thing, > but it's there. And it's just architecturally displeasing :) > Are you thinking of just moving the setup stuff in nbtree (presumably parts of > _bt_first() / _bt_preprocess_keys()) or also stuff in > ExecIndexBuildScanKeys()? Didn't really have specifics in mind. The key stumbling block is that some (not all) of the work depends on knowing the specific values of the indexqual comparison keys, so while you could do that work in advance for constant keys, you'd still have to be prepared to do work at scan start for non-constant keys. I don't have a clear idea about how to factorize that effectively. A couple of other random ideas in this space: * I suspect that a lot of this work overlaps with the efforts that btcostestimate makes along the way to getting a cost estimate. So it's interesting to wonder whether we could refactor so that btcostestimate is integrated with this hypothetical plan-time key preprocessing and doesn't duplicate work. * I think that we run through most or all of that preprocessing logic even for internal catalog accesses, where we know darn well how the keys are set up. We ought to think harder about how we could short-circuit pointless work in those code paths. I don't think any of this is an essential prerequisite to getting something done for loose index scans, which ISTM ought to be the first point of attack for v16. Loose index scans per se shouldn't add much to the key preprocessing costs. But these ideas likely would be useful to look into before anyone starts on the more complicated preprocessing that would be needed for the ideas in the MDAM paper. regards, tom lane
Re: MDAM techniques and Index Skip Scan patch
Hi, On 2022-03-22 16:55:49 -0400, Tom Lane wrote: > 4. I find each of the above ideas to be far more attractive than > optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really > understand why the current patchsets seem to be driven largely > by that single use-case. It's something causing plenty pain in production environments... Obviously it'd be even better if the optimization also triggered in cases like SELECT some_indexed_col FROM blarg GROUP BY some_indexed_col which seems to be what ORMs like to generate. > BTW, I've had a bee in my bonnet for a long time about whether some of > nbtree's scan setup work could be done once during planning, rather than > over again during each indexscan start. It does show up in simple-index-lookup heavy workloads. Not as a major thing, but it's there. And it's just architecturally displeasing :) Are you thinking of just moving the setup stuff in nbtree (presumably parts of _bt_first() / _bt_preprocess_keys()) or also stuff in ExecIndexBuildScanKeys()? The latter does show up a bit more heavily in profiles than nbtree specific setup, and given that it's generic executor type stuff, seems even more amenable to being moved to plan time. Greetings, Andres Freund
Re: MDAM techniques and Index Skip Scan patch
Peter Geoghegan writes: > Like many difficult patches, the skip scan patch is not so much > troubled by problems with the implementation as it is troubled by > *ambiguity* about the design. Particularly concerning how skip scan > meshes with existing designs, as well as future designs -- > particularly designs for other MDAM techniques. I've started this > thread to have a big picture conversation about how to think about > these things. Peter asked me off-list to spend some time thinking about the overall direction we ought to be pursuing here. I have done that, and here are a few modest suggestions. 1. Usually I'm in favor of doing this sort of thing in an index AM agnostic way, but here I don't see much point. All of the ideas at stake rely fundamentally on having a lexicographically-ordered multi column index; but we don't have any of those except btree, nor do I think we're likely to get any soon. This motivates the general tenor of my remarks below, which is "do it in access/nbtree/ not in the planner". 2. The MDAM paper Peter cited is really interesting. You can see fragments of those ideas in our existing btree code, particularly in the scan setup stuff that detects redundant or contradictory keys and determines a scan start strategy. The special handling we implemented awhile ago for ScalarArrayOp index quals is also a subset of what they were talking about. It seems to me that if we wanted to implement more of those ideas, the relevant work should almost all be done in nbtree proper. The planner would need only minor adjustments: btcostestimate would have to be fixed to understand the improvements, and there are some rules in indxpath.c that prevent us from passing "too complicated" sets of indexquals to the AM, which would need to be relaxed or removed altogether. 3. "Loose" indexscan (i.e., sometimes re-descend from the tree root to find the next index entry) is again something that seems like it's mainly nbtree's internal problem. Loose scan is interesting if we have index quals for columns that are after the first column that lacks an equality qual, otherwise not. I've worried in the past that we'd need planner/statistical support to figure out whether a loose scan is likely to be useful compared to just plowing ahead in the index. However, that seems to be rendered moot by the idea used in the current patchsets, ie scan till we find that we'll have to step off the current page, and re-descend at that point. (When and if we find that that heuristic is inadequate, we could work on passing some statistical data forward. But we don't need any in the v1 patch.) Again, we need some work in btcostestimate to understand how the index scan cost will be affected, but I still don't see any pressing need for major planner changes or plan tree contents changes. 4. I find each of the above ideas to be far more attractive than optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really understand why the current patchsets seem to be driven largely by that single use-case. I wouldn't even bother with that for the initial patch. When we do get around to it, it still doesn't need major planner support, I think --- again fixing the cost estimation is the bulk of the work. Munro's original 2014 patch showed that we don't really need all that much to get the planner to build such a plan; the problem is to convince it that that plan will be cheap. In short: I would throw out just about all the planner infrastructure that's been proposed so far. It looks bulky, expensive, and drastically undercommented, and I don't think it's buying us anything of commensurate value. The part of the planner that actually needs serious thought is btcostestimate, which has been woefully neglected in both of the current patchsets. BTW, I've had a bee in my bonnet for a long time about whether some of nbtree's scan setup work could be done once during planning, rather than over again during each indexscan start. This issue might become more pressing if the work becomes significantly more complicated/expensive, which these ideas might cause. But it's a refinement that could be left for later --- and in any case, the responsibility would still fundamentally be nbtree's. I don't think the planner would do more than call some AM routine that could add decoration to an IndexScan plan node. Now ... where did I put my flameproof vest? regards, tom lane
Re: MDAM techniques and Index Skip Scan patch
On Tue, Mar 22, 2022 at 2:34 PM Andres Freund wrote: > IMO it's pretty clear that having "duelling" patches below one CF entry is a > bad idea. I think they should be split, with inactive approaches marked as > returned with feeback or whatnot. I have the impression that this thread is getting some value from having a CF entry, as a multi-person collaboration where people are trading ideas and also making progress that no one wants to mark as returned, but it's vexing for people managing the CF because it's not really proposed for 15. Perhaps what we lack is a new status, "Work In Progress" or something?
Re: MDAM techniques and Index Skip Scan patch
> On Mon, Mar 21, 2022 at 06:34:09PM -0700, Andres Freund wrote: > > > I don't mind having all of the alternatives under the same CF item, only > > one being tested seems to be only a small limitation of cfbot. > > IMO it's pretty clear that having "duelling" patches below one CF entry is a > bad idea. I think they should be split, with inactive approaches marked as > returned with feeback or whatnot. On the other hand even for patches with dependencies (i.e. the patch A depends on the patch B) different CF items cause a lot of confusion for reviewers. I guess for various flavours of the same patch it would be even worse. But I don't have a strong opinion here. > Either way, currently this patch fails on cfbot due to a new GUC: > https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs > https://cirrus-ci.com/task/5134905372835840 This seems to be easy to solve. >From 5bae9fdf8b74e5996b606e78f8b2a5fb327e011b Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Mon, 17 May 2021 11:47:07 +0200 Subject: [PATCH v41 1/6] Unique expressions Extend PlannerInfo and Path structures with the list of relevant unique expressions. It specifies which keys must be unique on the query level, and allows to leverage this into on the path level. At the moment only distinctClause makes use of such mechanism, which enables potential use of index skip scan. Originally proposed by David Rowley, based on the UniqueKey patch implementation from Andy Fan, contains few bits out of previous version from Jesper Pedersen, Floris Van Nee. --- src/backend/nodes/list.c| 31 + src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/pathkeys.c | 62 + src/backend/optimizer/path/uniquekeys.c | 92 + src/backend/optimizer/plan/planner.c| 36 +- src/backend/optimizer/util/pathnode.c | 32 ++--- src/include/nodes/pathnodes.h | 5 ++ src/include/nodes/pg_list.h | 1 + src/include/optimizer/pathnode.h| 1 + src/include/optimizer/paths.h | 9 +++ 10 files changed, 261 insertions(+), 11 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekeys.c diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index f843f861ef..a53a50f372 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -1653,3 +1653,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2) return 1; return 0; } + +/* + * Return true iff every entry in "members" list is also present + * in the "target" list. + */ +bool +list_is_subset(const List *members, const List *target) +{ + const ListCell *lc1, *lc2; + + Assert(IsPointerList(members)); + Assert(IsPointerList(target)); + check_list_invariants(members); + check_list_invariants(target); + + foreach(lc1, members) + { + bool found = false; + foreach(lc2, target) + { + if (equal(lfirst(lc1), lfirst(lc2))) + { +found = true; +break; + } + } + if (!found) + return false; + } + return true; +} diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile index 1e199ff66f..7b9820c25f 100644 --- a/src/backend/optimizer/path/Makefile +++ b/src/backend/optimizer/path/Makefile @@ -21,6 +21,7 @@ OBJS = \ joinpath.o \ joinrels.o \ pathkeys.o \ - tidpath.o + tidpath.o \ + uniquekeys.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 86a35cdef1..e2be1fbf90 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -29,6 +29,7 @@ #include "utils/lsyscache.h" +static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys); static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys); static bool matches_boolean_partition_clause(RestrictInfo *rinfo, RelOptInfo *partrel, @@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root, return pk; } +/* + * pathkey_is_unique + * Checks if the new pathkey's equivalence class is the same as that of + * any existing member of the pathkey list. + */ +static bool +pathkey_is_unique(PathKey *new_pathkey, List *pathkeys) +{ + EquivalenceClass *new_ec = new_pathkey->pk_eclass; + ListCell *lc; + + /* If same EC already is already in the list, then not unique */ + foreach(lc, pathkeys) + { + PathKey*old_pathkey = (PathKey *) lfirst(lc); + + if (new_ec == old_pathkey->pk_eclass) + return false; + } + + return true; +} + /* * pathkey_is_redundant * Is a pathkey redundant with one already in the given list? @@ -1152,6 +1176,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root, return pathkeys; } +/* + * make_pathkeys_for_uniquekeyclauses + * Generate a pathkeys list to be used for uniquekey clauses + */ +Lis
Re: MDAM techniques and Index Skip Scan patch
Hi, On 2022-01-22 22:37:19 +0100, Dmitry Dolgov wrote: > > On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote: > > Hi, > > > > On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote: > > > > > > FYI, I've attached this thread to the CF item as an informational one, > > > but as there are some patches posted here, folks may get confused. For > > > those who have landed here with no context, I feel obliged to mention > > > that now there are two alternative patch series posted under the same > > > CF item: > > > > > > * the original one lives in [1], waiting for reviews since the last May > > > * an alternative one posted here from Floris > > > > Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed. > > It's nice to have the alternative approach threads linkied in the commit > > fest, > > but it seems that the cfbot will use the most recent attachments as the only > > patchset, thus leaving the "original" one untested. > > > > I'm not sure of what's the best approach in such situation. Maybe creating > > a > > different CF entry for each alternative, and link the other cf entry on the > > cf > > app using the "Add annotations" or "Links" feature rather than attaching > > threads? > > I don't mind having all of the alternatives under the same CF item, only > one being tested seems to be only a small limitation of cfbot. IMO it's pretty clear that having "duelling" patches below one CF entry is a bad idea. I think they should be split, with inactive approaches marked as returned with feeback or whatnot. Either way, currently this patch fails on cfbot due to a new GUC: https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs https://cirrus-ci.com/task/5134905372835840 Greetings, Andres Freund
Re: Index Skip Scan (new UniqueKeys)
On Mon, Jan 24, 2022 at 8:51 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote: > > On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> > wrote: > > > Besides that the new patch version contains some cleaning up and > > > addresses commentaries around leaf page pinning from [1]. The idea > > > behind the series structure is still the same: the first three patches > > > contains the essence of the implementation (hoping to help concentrate > > > review), the rest are more "experimental". > > > > > > [1]: > > > > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com > > > > Hi, > > > > + /* If same EC already is already in the list, then not unique */ > > > > The word already is duplicated. > > > > + * make_pathkeys_for_uniquekeyclauses > > > > The func name in the comment is different from the actual func name. > > Thanks for the review! Right, both above make sense. I'll wait a bit if > there will be more commentaries, and then post a new version with all > changes at once. > > > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > > > > The year should be 2022 :-) > > Now you see how old is this patch :) > > > make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should > > make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ? > > It's actually placed there by analogy with make_pathkeys_for_sortclauses > (immediately preceding function), so I think moving it into uniquekeys > will only make more confusion. > > > +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys, > > +bool allow_multinulls) > > > > It seems allow_multinulls is not checked in the func. Can the parameter > be > > removed ? > > Right, it could be removed. I believe it was somewhat important when the > patch was tightly coupled with the UniqueKeys patch, where it was put in > use. > > Hi, It would be nice to take out this unused parameter for this patch. The parameter should be added in patch series where it is used. Cheers
Re: Index Skip Scan (new UniqueKeys)
> On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote: > On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Besides that the new patch version contains some cleaning up and > > addresses commentaries around leaf page pinning from [1]. The idea > > behind the series structure is still the same: the first three patches > > contains the essence of the implementation (hoping to help concentrate > > review), the rest are more "experimental". > > > > [1]: > > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com > > Hi, > > + /* If same EC already is already in the list, then not unique */ > > The word already is duplicated. > > + * make_pathkeys_for_uniquekeyclauses > > The func name in the comment is different from the actual func name. Thanks for the review! Right, both above make sense. I'll wait a bit if there will be more commentaries, and then post a new version with all changes at once. > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > > The year should be 2022 :-) Now you see how old is this patch :) > make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should > make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ? It's actually placed there by analogy with make_pathkeys_for_sortclauses (immediately preceding function), so I think moving it into uniquekeys will only make more confusion. > +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys, > +bool allow_multinulls) > > It seems allow_multinulls is not checked in the func. Can the parameter be > removed ? Right, it could be removed. I believe it was somewhat important when the patch was tightly coupled with the UniqueKeys patch, where it was put in use. > + Path *newpath; > + > + newpath = (Path *) create_projection_path(root, rel, subpath, > + scanjoin_target); > > You can remove variable newpath and assign to lfirst(lc) directly. Yes, but I've followed the same style for create_projection_path as in many other invocations of this function in planner.c -- I would prefer to keep it uniform. > +add_path(RelOptInfo *parent_rel, Path *new_path) > +add_unique_path(RelOptInfo *parent_rel, Path *new_path) > > It seems the above two func's can be combined into one func which > takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter. Sure, but here I've intentionally split it into separate functions, otherwise a lot of not relevant call sites have to be updated to provide the third parameter.
Re: Index Skip Scan (new UniqueKeys)
On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote: > > Hi, > > > > Here is another take on the patch with a couple of changes: > > > > * I've removed for now UniqueKeys parts. The interaction of skip scan & > > unique keys patch was actually not that big, so the main difference is > > that now the structure itself went away, a list of unique expressions > > is used instead. All the suggestions about how this feature should > > look like from the planning perspective are still there. On the one > > hand it will allow to develop both patches independently and avoid > > confusion for reviewers, on the other UniqueKeys could be easily > > incorporated back when needed. > > > > * Support for skipping in case of moving backward on demand (scroll > > cursor) is moved into a separate patch. This is implemented via > > returning false from IndexSupportsBackwardScan in case if it's a skip > > scan node, which in turn adds Materialize node on top when needed. The > > name SupportsBackwardScan was a bit confusing for me, but it seems > > it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL. > > Eventually those cases when BackwardScanDirection is used are still > > handled by amskip. This change didn't affect the test coverage, all > > the test cases supported in previous patch versions are still there. > > > > About Materialize node, I guess it could be less performant than a > > "native" support, but it simplifies the implementation significantly > > to the point that most parts, which were causing questions before, are > > now located in the isolated patch. My idea here is to concentrate > > efforts on the first three patches in this series, and consider the > > rest of them as an experiment field. > > > > * IndexScan support was also relocated into a separate patch, the first > > three patches are now only about IndexOnlyScan. > > > > * Last bits of reviews were incorporated and rebased. > > While the patch is still waiting for a review, I was motivated by the > thread [1] to think about it from the interface point of view. Consider > an index skip scan being just like a normal index scan with a set of > underspecified leading search keys. It makes sense to have the same > structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm > not sure if amskip is a good name to represent that, don't have anything > better yet). But the "underspecified" part is currently indeed > interpreted in a limited way -- as "missing" keys -- and is expressed > only via the prefix size. Another option would be e.g. leading keys > constrained by a range of values, so generally speaking it makes sense > to extend amount of the information provided for skipping. > > As a naive approach I've added a new patch into the series, containing > the extra data structure (ScanLooseKeys, doesn't have much meaning yet > except somehow representing keys for skipping) used for index skip scan. > Any thoughts about it? > > Besides that the new patch version contains some cleaning up and > addresses commentaries around leaf page pinning from [1]. The idea > behind the series structure is still the same: the first three patches > contains the essence of the implementation (hoping to help concentrate > review), the rest are more "experimental". > > [1]: > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com Hi, + /* If same EC already is already in the list, then not unique */ The word already is duplicated. + * make_pathkeys_for_uniquekeyclauses The func name in the comment is different from the actual func name. + * Portions Copyright (c) 2020, PostgreSQL Global Development Group The year should be 2022 :-) make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ? +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys, +bool allow_multinulls) It seems allow_multinulls is not checked in the func. Can the parameter be removed ? + Path *newpath; + + newpath = (Path *) create_projection_path(root, rel, subpath, + scanjoin_target); You can remove variable newpath and assign to lfirst(lc) directly. +add_path(RelOptInfo *parent_rel, Path *new_path) +add_unique_path(RelOptInfo *parent_rel, Path *new_path) It seems the above two func's can be combined into one func which takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter. Cheers
Re: MDAM techniques and Index Skip Scan patch
> On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote: > Hi, > > On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote: > > > > FYI, I've attached this thread to the CF item as an informational one, > > but as there are some patches posted here, folks may get confused. For > > those who have landed here with no context, I feel obliged to mention > > that now there are two alternative patch series posted under the same > > CF item: > > > > * the original one lives in [1], waiting for reviews since the last May > > * an alternative one posted here from Floris > > Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed. > It's nice to have the alternative approach threads linkied in the commit fest, > but it seems that the cfbot will use the most recent attachments as the only > patchset, thus leaving the "original" one untested. > > I'm not sure of what's the best approach in such situation. Maybe creating a > different CF entry for each alternative, and link the other cf entry on the cf > app using the "Add annotations" or "Links" feature rather than attaching > threads? I don't mind having all of the alternatives under the same CF item, only one being tested seems to be only a small limitation of cfbot.
Re: Index Skip Scan (new UniqueKeys)
> On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote: > Hi, > > Here is another take on the patch with a couple of changes: > > * I've removed for now UniqueKeys parts. The interaction of skip scan & > unique keys patch was actually not that big, so the main difference is > that now the structure itself went away, a list of unique expressions > is used instead. All the suggestions about how this feature should > look like from the planning perspective are still there. On the one > hand it will allow to develop both patches independently and avoid > confusion for reviewers, on the other UniqueKeys could be easily > incorporated back when needed. > > * Support for skipping in case of moving backward on demand (scroll > cursor) is moved into a separate patch. This is implemented via > returning false from IndexSupportsBackwardScan in case if it's a skip > scan node, which in turn adds Materialize node on top when needed. The > name SupportsBackwardScan was a bit confusing for me, but it seems > it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL. > Eventually those cases when BackwardScanDirection is used are still > handled by amskip. This change didn't affect the test coverage, all > the test cases supported in previous patch versions are still there. > > About Materialize node, I guess it could be less performant than a > "native" support, but it simplifies the implementation significantly > to the point that most parts, which were causing questions before, are > now located in the isolated patch. My idea here is to concentrate > efforts on the first three patches in this series, and consider the > rest of them as an experiment field. > > * IndexScan support was also relocated into a separate patch, the first > three patches are now only about IndexOnlyScan. > > * Last bits of reviews were incorporated and rebased. While the patch is still waiting for a review, I was motivated by the thread [1] to think about it from the interface point of view. Consider an index skip scan being just like a normal index scan with a set of underspecified leading search keys. It makes sense to have the same structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm not sure if amskip is a good name to represent that, don't have anything better yet). But the "underspecified" part is currently indeed interpreted in a limited way -- as "missing" keys -- and is expressed only via the prefix size. Another option would be e.g. leading keys constrained by a range of values, so generally speaking it makes sense to extend amount of the information provided for skipping. As a naive approach I've added a new patch into the series, containing the extra data structure (ScanLooseKeys, doesn't have much meaning yet except somehow representing keys for skipping) used for index skip scan. Any thoughts about it? Besides that the new patch version contains some cleaning up and addresses commentaries around leaf page pinning from [1]. The idea behind the series structure is still the same: the first three patches contains the essence of the implementation (hoping to help concentrate review), the rest are more "experimental". [1]: https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com >From 8b61823e523ceecbb2fd6faa1a229038bb981594 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Mon, 17 May 2021 11:47:07 +0200 Subject: [PATCH v40 1/6] Unique expressions Extend PlannerInfo and Path structures with the list of relevant unique expressions. It specifies which keys must be unique on the query level, and allows to leverage this into on the path level. At the moment only distinctClause makes use of such mechanism, which enables potential use of index skip scan. Originally proposed by David Rowley, based on the UniqueKey patch implementation from Andy Fan, contains few bits out of previous version from Jesper Pedersen, Floris Van Nee. --- src/backend/nodes/list.c| 31 + src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/pathkeys.c | 62 + src/backend/optimizer/path/uniquekeys.c | 92 + src/backend/optimizer/plan/planner.c| 36 +- src/backend/optimizer/util/pathnode.c | 32 ++--- src/include/nodes/pathnodes.h | 5 ++ src/include/nodes/pg_list.h | 1 + src/include/optimizer/pathnode.h| 1 + src/include/optimizer/paths.h | 9 +++ 10 files changed, 261 insertions(+), 11 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekeys.c diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index f843f861ef..a53a50f372 100644 ---
Re: MDAM techniques and Index Skip Scan patch
Hi, On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote: > > FYI, I've attached this thread to the CF item as an informational one, > but as there are some patches posted here, folks may get confused. For > those who have landed here with no context, I feel obliged to mention > that now there are two alternative patch series posted under the same > CF item: > > * the original one lives in [1], waiting for reviews since the last May > * an alternative one posted here from Floris Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed. It's nice to have the alternative approach threads linkied in the commit fest, but it seems that the cfbot will use the most recent attachments as the only patchset, thus leaving the "original" one untested. I'm not sure of what's the best approach in such situation. Maybe creating a different CF entry for each alternative, and link the other cf entry on the cf app using the "Add annotations" or "Links" feature rather than attaching threads?
Re: MDAM techniques and Index Skip Scan patch
> On Thu, Jan 13, 2022 at 03:27:08PM +, Floris Van Nee wrote: > > > > Could you send a rebased version? In the meantime I will change the status > > on the cf app to Waiting on Author. > > Attached a rebased version. FYI, I've attached this thread to the CF item as an informational one, but as there are some patches posted here, folks may get confused. For those who have landed here with no context, I feel obliged to mention that now there are two alternative patch series posted under the same CF item: * the original one lives in [1], waiting for reviews since the last May * an alternative one posted here from Floris [1]: https://www.postgresql.org/message-id/flat/20200609102247.jdlatmfyeecg52fi@localhost
Re: MDAM techniques and Index Skip Scan patch
Hi, On Sat, Oct 23, 2021 at 07:30:47PM +, Floris Van Nee wrote: > > From the patch series above, v9-0001/v9-0002 is the UniqueKeys patch series, > which focuses on the planner. v2-0001 is Dmitry's patch that extends it to > make it possible to use UniqueKeys for the skip scan. Both of these are > unfortunately still older patches, but because they are planner machinery > they are less relevant to the discussion about the executor here. Patch > v2-0002 contains most of my work and introduces all the executor logic for > the skip scan and hooks up the planner for DISTINCT queries to use the skip > scan. Patch v2-0003 is a planner hack that makes the planner pick a skip > scan on virtually every possibility. This also enables the skip scan on the > queries above that don't have a DISTINCT but where it could be useful. The patchset doesn't apply anymore: http://cfbot.cputube.org/patch_36_1741.log === Applying patches on top of PostgreSQL commit ID a18b6d2dc288dfa6e7905ede1d4462edd6a8af47 === === applying patch ./v2-0001-Extend-UniqueKeys.patch [...] patching file src/include/optimizer/paths.h Hunk #2 FAILED at 299. 1 out of 2 hunks FAILED -- saving rejects to file src/include/optimizer/paths.h.rej Could you send a rebased version? In the meantime I will change the status on the cf app to Waiting on Author.
Re: MDAM techniques and Index Skip Scan patch
Hi Peter, On 10/21/21 22:16, Peter Geoghegan wrote: I returned to the 1995 paper "Efficient Search of Multidimensional B-Trees" [1] as part of the process of reviewing v39 of the skip scan patch, which was posted back in May. It's a great paper, and anybody involved in the skip scan effort should read it thoroughly (if they haven't already). It's easy to see why people get excited about skip scan [2]. But there is a bigger picture here. Thanks for starting this thread ! The Index Skip Scan patch could affect a lot of areas, so keeping MDAM in mind is definitely important. However, I think the key part to progress on the "core" functionality (B-tree related changes) is to get the planner functionality in place first. Hopefully we can make progress on that during the November CommitFest based on Andy's patch. Best regards, Jesper
Re: MDAM techniques and Index Skip Scan patch
> On Thu, Oct 21, 2021 at 07:16:00PM -0700, Peter Geoghegan wrote: > > My general concern is that the skip scan patch may currently be > structured in a way that paints us into a corner, MDAM-wise. > > Note that the MDAM paper treats skipping a prefix of columns as a case > where the prefix is handled by pretending that there is a clause that > looks like this: "WHERE date between -inf AND +inf" -- which is not so > different from the original sales SQL query example that I have > highlighted. We don't tend to think of queries like this (like my > sales query) as in any way related to skip-scan, because we don't > imagine that there is any skipping going on. But maybe we should > recognize the similarities. To avoid potential problems with extensibility in this sense, the implementation needs to explicitly work with sets of disjoint intervals of values instead of simple prefix size, one set of intervals per scan key. An interesting idea, doesn't seem to be a big change in terms of the patch itself.
MDAM techniques and Index Skip Scan patch
I returned to the 1995 paper "Efficient Search of Multidimensional B-Trees" [1] as part of the process of reviewing v39 of the skip scan patch, which was posted back in May. It's a great paper, and anybody involved in the skip scan effort should read it thoroughly (if they haven't already). It's easy to see why people get excited about skip scan [2]. But there is a bigger picture here. I don't necessarily expect to come away from this discussion with a much better high level architecture for the patch, or any kind of deeper insight, or even a frame of reference for further discussion. I just think that we ought to *try* to impose some order on this stuff. Like many difficult patches, the skip scan patch is not so much troubled by problems with the implementation as it is troubled by *ambiguity* about the design. Particularly concerning how skip scan meshes with existing designs, as well as future designs -- particularly designs for other MDAM techniques. I've started this thread to have a big picture conversation about how to think about these things. Many other MDAM techniques also seem highly appealing. Much of the MDAM stuff is for data warehousing use-cases, while skip scan/loose index scan is seen as more of an OLTP thing. But they are still related, clearly. I'd like to also talk about another patch, that ISTM had that same quality -- it was also held back by high level design uncertainty. Back in 2018, Tom abandoned a patch that transformed a star-schema style query with left outer joins on dimension tables with OR conditions, into an equivalent query that UNIONs together 2 distinct queries [3][4]. Believe it or not, I am now reminded of that patch by the example of "IN() Lists", from page 5 of the paper. We see this example SQL query: SELECT date, item_class, store, sum(total_sales) FROM sales WHERE date between '06/01/95' and '06/30/95' and item_class IN (20,35,50) and store IN (200,250) GROUP BY dept, date, item_class, store; Granted, this SQL might not seem directly relevant to Tom's patch at first -- there is no join for the optimizer to even try to eliminate, which was the whole basis of Jim Nasby's original complaint, which is what spurred Tom to write the patch in the first place. But hear me out: there is still a fact table (the sales table) with some dimensions (the 'D' from 'MDAM') shown in the predicate. Moreover, the table (and this SQL query) drives discussion of an optimization involving transforming a predicate with many ORs (which is explicitly said to be logically/semantically equivalent to the IN() lists from the query). They transform the query into a bunch of disjunct clauses that can easily be independently executed, and combined at the end (see also "General OR Optimization" on page 6 of the paper). Also...I'm not entirely sure that the intended underlying "physical plan" is truly free of join-like scans. If you squint just right, you might see something that you could think of as a "physical join" (at least very informally). The whole point of this particular "IN() Lists" example is that we get to the following, for each distinct "dept" and "date" in the table: dept=1, date='06/04/95', item_class=20, store=200 dept=1, date='06/04/95', item_class=20, store=250 dept=1, date='06/04/95', item_class=35, store=200 dept=1, date='06/04/95', item_class=35, store=250 dept=1, date='06/04/95', item_class=50, store=200 dept=1, date='06/04/95', item_class=50, store=250 There are 2400 such accesses in total after transformation -- imagine additional lines like these, for every distinct combination of dept and date (only for those dates that actually had sales, which they enumerate up-front), for store 200 and 250, and item_class 20, 35, and 50. This adds up to 2400 lines in total. Even 2400 index probes will be much faster than a full table scan, given that this is a large fact table. The "sales" table is a clustered index whose keys are on the columns "(dept, date, item_class, store)", per note at the top of page 4. The whole point is to avoid having any secondary indexes on this fact table, without getting a full scan. We can just probe the primary key 2400 times instead, following this transformation. No need for secondary indexes. The plan can be thought of as a DAG, at least informally. This is also somewhat similar to what Tom was thinking about back in 2018. Tom had to deduplicate rows during execution (IIRC using a UNION style ad-hoc approach that sorted on TIDs), whereas I think that they can get away with skipping that extra step. Page 7 says "MDAM removes duplicates before reading the data, so it does not have to do any post read operations to accomplish duplicate elimination (a common problem with OR optimization)". My general concern is that the skip scan patch may currently be structured in a way that paints us into a corner, MDAM-wise. Note that the MDAM paper treats skipping a prefix of columns as a case where the prefix is handled by pretending that there is a
Re: Index Skip Scan (new UniqueKeys)
Hi, Here is another take on the patch with a couple of changes: * I've removed for now UniqueKeys parts. The interaction of skip scan & unique keys patch was actually not that big, so the main difference is that now the structure itself went away, a list of unique expressions is used instead. All the suggestions about how this feature should look like from the planning perspective are still there. On the one hand it will allow to develop both patches independently and avoid confusion for reviewers, on the other UniqueKeys could be easily incorporated back when needed. * Support for skipping in case of moving backward on demand (scroll cursor) is moved into a separate patch. This is implemented via returning false from IndexSupportsBackwardScan in case if it's a skip scan node, which in turn adds Materialize node on top when needed. The name SupportsBackwardScan was a bit confusing for me, but it seems it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL. Eventually those cases when BackwardScanDirection is used are still handled by amskip. This change didn't affect the test coverage, all the test cases supported in previous patch versions are still there. About Materialize node, I guess it could be less performant than a "native" support, but it simplifies the implementation significantly to the point that most parts, which were causing questions before, are now located in the isolated patch. My idea here is to concentrate efforts on the first three patches in this series, and consider the rest of them as an experiment field. * IndexScan support was also relocated into a separate patch, the first three patches are now only about IndexOnlyScan. * Last bits of reviews were incorporated and rebased. >From 074fc8a41b43dd8b10ab29eb880c9d161a1638d5 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Mon, 17 May 2021 11:47:07 +0200 Subject: [PATCH v39 1/5] Unique expressions Extend PlannerInfo and Path structures with the list of relevant unique expressions. It specifies which keys must be unique on the query level, and allows to leverage this into on the path level. At the moment only distinctClause makes use of such mechanism, which enables potential use of index skip scan. Originally proposed by David Rowley, based on the UniqueKey patch implementation from Andy Fan, contains few bits out of previous version from Jesper Pedersen, Floris Van Nee. --- src/backend/nodes/list.c| 31 + src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/pathkeys.c | 62 + src/backend/optimizer/path/uniquekeys.c | 92 + src/backend/optimizer/plan/planner.c| 36 +- src/backend/optimizer/util/pathnode.c | 32 ++--- src/include/nodes/pathnodes.h | 5 ++ src/include/nodes/pg_list.h | 1 + src/include/optimizer/pathnode.h| 1 + src/include/optimizer/paths.h | 9 +++ 10 files changed, 261 insertions(+), 11 deletions(-) create mode 100644 src/backend/optimizer/path/uniquekeys.c diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index 94fb236daf..9f2c408a4e 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -1537,3 +1537,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2) return 1; return 0; } + +/* + * Return true iff every entry in "members" list is also present + * in the "target" list. + */ +bool +list_is_subset(const List *members, const List *target) +{ + const ListCell *lc1, *lc2; + + Assert(IsPointerList(members)); + Assert(IsPointerList(target)); + check_list_invariants(members); + check_list_invariants(target); + + foreach(lc1, members) + { + bool found = false; + foreach(lc2, target) + { + if (equal(lfirst(lc1), lfirst(lc2))) + { +found = true; +break; + } + } + if (!found) + return false; + } + return true; +} diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile index 1e199ff66f..7b9820c25f 100644 --- a/src/backend/optimizer/path/Makefile +++ b/src/backend/optimizer/path/Makefile @@ -21,6 +21,7 @@ OBJS = \ joinpath.o \ joinrels.o \ pathkeys.o \ - tidpath.o + tidpath.o \ + uniquekeys.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index bd9a176d7d..f28547148d 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -29,6 +29,7 @@ #include "utils/lsyscache.h" +static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys); static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys); static bool matches_boolean_partition_clause(RestrictInfo *rinfo, RelOptInfo *partrel, @@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root, return pk; } +/* +
Re: Index Skip Scan (new UniqueKeys)
On 3/17/21 6:02 PM, Dmitry Dolgov wrote: >> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote: >> Hi, >> >> I took a look at the new patch series, focusing mostly on the uniquekeys >> part. It'd be a bit tedious to explain all the review comments here, so >> attached is a patch series with a "review" patch for some of the parts. > > Great, thanks. > >> Most of it is fairly small (corrections to comments etc.), I'll go over >> the more serious part so that we can discuss it here. I'll keep it split >> per parts of the original patch series. >> I suggest looking for XXX and FIXME comments in all the review patches. >> >> >> 0001 >> >> >> >> >> 0002 >> >> > > In fact both 0001 & 0002 belong to another thread, which these days > span [1], [2]. I've included them only because they happened to be a > dependency for index skip scan following David suggestions, sorry if > it's confusing. > > At the same time the author behind 0001 & 0002 is present in this thread > as well, maybe Andy can answer these comments right here and better than me. > Ah, sorry for the confusion. In that case the review comments probably belong to the other threads, so we should move the discussion there. It's not clear to me which of the threads is the right one. >> 0003 >> >> >> Just some comments/whitespace. >> >> >> 0004 >> >> >> I wonder why we don't include this in explain TEXT format? Seems it >> might make it harder to write regression tests for this? It's easier to >> just check that we deduced the right unique key(s) than having to >> construct an example where it actually changes the plan. > > Yeah, good point. I believe originally it was like that to not make > explain too verbose for skip scans, but displaying prefix definitely > could be helpful for testing, so will do this (and address other > comments as well). > Cool. Thanks. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Index Skip Scan (new UniqueKeys)
> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote: > Hi, > > I took a look at the new patch series, focusing mostly on the uniquekeys > part. It'd be a bit tedious to explain all the review comments here, so > attached is a patch series with a "review" patch for some of the parts. Great, thanks. > Most of it is fairly small (corrections to comments etc.), I'll go over > the more serious part so that we can discuss it here. I'll keep it split > per parts of the original patch series. > I suggest looking for XXX and FIXME comments in all the review patches. > > > 0001 > > > > > 0002 > > In fact both 0001 & 0002 belong to another thread, which these days span [1], [2]. I've included them only because they happened to be a dependency for index skip scan following David suggestions, sorry if it's confusing. At the same time the author behind 0001 & 0002 is present in this thread as well, maybe Andy can answer these comments right here and better than me. > 0003 > > > Just some comments/whitespace. > > > 0004 > > > I wonder why we don't include this in explain TEXT format? Seems it > might make it harder to write regression tests for this? It's easier to > just check that we deduced the right unique key(s) than having to > construct an example where it actually changes the plan. Yeah, good point. I believe originally it was like that to not make explain too verbose for skip scans, but displaying prefix definitely could be helpful for testing, so will do this (and address other comments as well). [1]: https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com [2]: https://www.postgresql.org/message-id/flat/caku4awru35c9g3ce15jmvwh6b2hzf4hf7czukrsiktv7akr...@mail.gmail.com
Re: Index Skip Scan (new UniqueKeys)
> On Thu, Jan 28, 2021 at 09:49:26PM +0900, Masahiko Sawada wrote: > Hi Dmitry, > > Status update for a commitfest entry. > > This patch entry has been "Waiting on Author" on CF app and the > discussion seems inactive from the last CF. Could you share the > current status of this patch? Heikki already sent review comments and > there was a discussion but the WoA status is correct? If it needs > reviews, please rebase the patches and set it to "Needs Reviews" on CF > app. If you're not working on this, I'm going to set it to "Returned > with Feedback", barring objections. Yes, I'm still on it. In fact, I've sketched up almost immediately couple of changes to address Heikki feedback, but was distracted by subscripting stuff. Will try to send new version of the patch soon.
Re: Index Skip Scan (new UniqueKeys)
Hi Dmitry, On Sun, Oct 25, 2020 at 1:45 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Tue, Oct 06, 2020 at 05:20:39PM +0200, Dmitry Dolgov wrote: > > > On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote: > > > > > > * I see the following compiler warning: > > > > > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: > > > In function ‘populate_baserel_uniquekeys’: > > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: > > > warning: ‘expr’ may be used uninitialized in this function > > > [-Wmaybe-uninitialized] > > > 797 | else if (!list_member(unique_index->rel->reltarget->exprs, > > > expr)) > > > | ^~ > > > > This is mostly for UniqueKeys patch, which is attached here only as a > > dependency, but I'll prepare changes for that. Interesting enough I > > can't reproduce this warning, but if I understand correctly gcc has some > > history of spurious uninitialized warnings, so I guess it could be > > version dependent. > > > > > * Perhaps the warning is related to this nearby code that I noticed > > > Valgrind complains about: > > > > > > ==1083468== VALGRINDERROR-BEGIN > > > ==1083468== Invalid read of size 4 > > > ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) > > > ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) > > > > This also belongs to UniqueKeys patch, but at least I can reproduce this > > one. My guess is that nkeycolums should be used there, not ncolums, > > which is visible in index_incuding tests. The same as previous one, will > > prepare corresponding changes. > > > > > * Do we really need the AM-level boolean flag/argument named > > > "scanstart"? Why not just follow the example of btgettuple(), which > > > determines whether or not the scan has been initialized based on the > > > current scan position? > > > > > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you > > > cannot or should not take the same approach as btgettuple(). And even > > > if you can't take exactly the same approach, I would still think that > > > the scan's opaque B-Tree state should remember if it's the first call > > > to _bt_skip() (rather than some subsequent call) in some other way > > > (e.g. carrying a "scanstart" bool flag directly). > > > > Yes, agree, carrying this flag inside the opaque state would be better. > > Here is a new version which doesn't require "scanstart" argument and > contains few other changes to address the issues mentioned earlier. It's > also based on the latest UniqueKeys patches with the valgrind issue > fixed (as before they're attached also just for the references, you can > find more in the original thread). I didn't rework commentaries yet, > will post it soon (need to get an inspiration first, probably via > reading Shakespeare unless someone has better suggestions). > > > > * Why is it okay to do anything important based on the > > > _bt_scankey_within_page() return value? > > > > > > If the page is empty, then how can we know that it's okay to go to the > > > next value? I'm concerned that there could be subtle bugs in this > > > area. VACUUM will usually just delete the empty page. But it won't > > > always do so, for a variety of reasons that aren't worth going into > > > now. This could mask bugs in this area. I'm concerned about patterns > > > like this one from _bt_skip(): > > > > > > while (!nextFound) > > > { > > > > > > > > > if (_bt_scankey_within_page(scan, so->skipScanKey, > > > so->currPos.buf, dir)) > > > { > > > ... > > > } > > > else > > > /* > > > * If startItup could be not found within the current > > > page, > > > * assume we found something new > > > */ > > > nextFound = true; > > > > > > } > > > > > > Why would you assume that "we found something new" here? In general I > > > just don't understand the design of _bt_skip(). I get the basic idea > > > of what you're trying to do, but it could really use better comments. > > > > Yeah, I'll put more efforts into clear comments. There are two different > > ways in which _bt_scankey_within_page is being used. > > > > The first one is to check if it's possible to skip traversal of the tree > > from root in case if what we're looking for could be on the current > > page. In this case an empty page would mean we need to search from the > > root, so not sure what could be the issue here? > > > > The second one (that you've highlighted above) I admit is probably the > > most questionable part of the patch and open for suggestions how to > > improve it. It's required for one
Re: Index Skip Scan (new UniqueKeys)
On Wed, Dec 2, 2020 at 9:59 AM Heikki Linnakangas wrote: > On 01/12/2020 22:21, Dmitry Dolgov wrote: > >> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: > >> - I'm surprised you need a new index AM function (amskip) for this. Can't > >> you just restart the scan with index_rescan()? The btree AM can check if > >> the > >> new keys are on the same page, and optimize the rescan accordingly, like > >> amskip does. That would speed up e.g. nested loop scans too, where the keys > >> just happen to be clustered. > > > > An interesting point. At the moment I'm not sure whether it's possible > > to implement skipping via index_rescan or not, need to take a look. But > > checking if the new keys are on the same page would introduce some > > overhead I guess, wouldn't it be too invasive to add it into already > > existing btree AM? > > I think it'll be OK. But if it's not, you could add a hint argument to > index_rescan() to hint the index AM that the new key is known to be > greater than the previous key. FWIW here's what I wrote about that years ago[1]: > It works by adding a new index operation 'skip' which the executor > code can use during a scan to advance to the next value (for some > prefix of the index's columns). That may be a terrible idea and > totally unnecessary... but let me explain my > reasoning: > > 1. Perhaps some index implementations can do something better than a > search for the next key value from the root. Is it possible or > desirable to use the current position as a starting point for a btree > traversal? I don't know. > > 2. It seemed that I'd need to create a new search ScanKey to use the > 'rescan' interface for skipping to the next value, but I already had > an insertion ScanKey so I wanted a way to just reuse that. But maybe > there is some other way to reuse existing index interfaces, or maybe > there is an easy way to make a new search ScanKey from the existing >insertion ScanKey? [1] https://www.postgresql.org/message-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww%40mail.gmail.com
Re: Index Skip Scan (new UniqueKeys)
> On Tue, Dec 01, 2020 at 10:59:22PM +0200, Heikki Linnakangas wrote: > > > > - Does this optimization apply to bitmap index scans? > > > > No, from what I understand it doesn't. > > Would it be hard to add? Don't need to solve everything in the first > version of this, but I think in principle you could do the same > optimization for bitmap index scans, so if the current API can't do it, > that's maybe an indication that the API isn't quite right. I would expect it should not be hard as at the moment all parts seems relatively generic. But of course I need to check, while it seems no one had bitmap index scans in mind while developing this patch. > > > I'm actually a bit confused why we need this condition. The IndexScan > > > executor node should call amskip() only after checking the additional > > > quals, > > > no? > > > > This part I don't quite get, what exactly you mean by checking the > > additional quals in the executor node? But at the end of the day this > > condition was implemented exactly to address the described issue, which > > was found later and added to the tests. > > As I understand this, the executor logic goes like this: > > query: SELECT DISTINCT ON (a, b) a, b FROM foo where c like '%y%' and a > like 'a%' and b = 'b'; > > 1. Call index_beginscan, keys: a >= 'a', b = 'b' > > 2. Call index_getnext, which returns first row to the Index Scan node > > 3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step > 2 to get next tuple. > > 4. Return tuple to parent node > > 5. index_amskip(), to the next tuple with a > 'a'. Goto 2. > > The logic should work fine, even if there are quals that are not indexable, > like "c like '%y'" in the above example. So why doesn't it work? What am I > missing? To remind myself how it works I went through this sequence, and from what I understand the qual "c like '%y%'" is evaluated in this case in ExecQual, not after index_getnext_tid (and values returned after skipping are reported as filtered out). So when it comes to index_skip only quals on a & b were evaluated. Or did you mean something else? Another small detail is that in the current implementation there is no goto 2 in the last step. Originally it was like that, but since skipping return an exact position that we need there was something like "find a value, then do one step back so that index_getnext will find it". Unfortunately this stepping back part turns out to be a source of troubles, and getting rid of it even allowed to make code somewhat more concise. But of course I'm open for suggestions about improvements.
Re: Index Skip Scan (new UniqueKeys)
On 01/12/2020 22:21, Dmitry Dolgov wrote: On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: I had a quick look at this patch. I haven't been following this thread, so sorry if I'm repeating old arguments, but here we go: Thanks! - I'm surprised you need a new index AM function (amskip) for this. Can't you just restart the scan with index_rescan()? The btree AM can check if the new keys are on the same page, and optimize the rescan accordingly, like amskip does. That would speed up e.g. nested loop scans too, where the keys just happen to be clustered. An interesting point. At the moment I'm not sure whether it's possible to implement skipping via index_rescan or not, need to take a look. But checking if the new keys are on the same page would introduce some overhead I guess, wouldn't it be too invasive to add it into already existing btree AM? I think it'll be OK. But if it's not, you could add a hint argument to index_rescan() to hint the index AM that the new key is known to be greater than the previous key. - Does this optimization apply to bitmap index scans? No, from what I understand it doesn't. Would it be hard to add? Don't need to solve everything in the first version of this, but I think in principle you could do the same optimization for bitmap index scans, so if the current API can't do it, that's maybe an indication that the API isn't quite right. - This logic in build_index_paths() is not correct: + /* +* Skip scan is not supported when there are qual conditions, which are not +* covered by index. The reason for that is that those conditions are +* evaluated later, already after skipping was applied. +* +* TODO: This implementation is too restrictive, and doesn't allow e.g. +* index expressions. For that we need to examine index_clauses too. +*/ + if (root->parse->jointree != NULL) + { + ListCell *lc; + + foreach(lc, (List *)root->parse->jointree->quals) + { + Node *expr, *qual = (Node *) lfirst(lc); + Var *var; + bool found = false; + + if (!is_opclause(qual)) + { + not_empty_qual = true; + break; + } + + expr = get_leftop(qual); + + if (!IsA(expr, Var)) + { + not_empty_qual = true; + break; + } + + var = (Var *) expr; + + for (int i = 0; i < index->ncolumns; i++) + { + if (index->indexkeys[i] == var->varattno) + { + found = true; + break; + } + } + + if (!found) + { + not_empty_qual = true; + break; + } + } + } If you care whether the qual is evaluated by the index AM or not, you need to also check that the operator is indexable. Attached is a query that demonstrates that problem. ... Also, you should probably check that the index quals are in the operator family as that used for the DISTINCT. Yes, good point, will change this in the next version. I'm actually a bit confused why we need this condition. The IndexScan executor node should call amskip() only after checking the additional quals, no? This part I don't quite get, what exactly you mean by checking the additional quals in the executor node? But at the end of the day this condition was implemented exactly to address the described issue, which was found later and added to the tests. As I understand this, the executor logic goes like this: query: SELECT DISTINCT ON (a, b) a, b FROM foo where c like '%y%' and a like 'a%' and b = 'b'; 1. Call index_beginscan, keys: a >= 'a', b = 'b' 2. Call index_getnext, which returns first row to the Index Scan node 3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step 2 to get next tuple. 4. Return tuple to parent node 5. index_amskip(), to the next tuple with a > 'a'. Goto 2. The logic should work fine, even if there are quals that are not indexable, like "c like '%y'" in the above example. So why doesn't it work? What am I
Re: Index Skip Scan (new UniqueKeys)
> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote: > > I had a quick look at this patch. I haven't been following this thread, so > sorry if I'm repeating old arguments, but here we go: Thanks! > - I'm surprised you need a new index AM function (amskip) for this. Can't > you just restart the scan with index_rescan()? The btree AM can check if the > new keys are on the same page, and optimize the rescan accordingly, like > amskip does. That would speed up e.g. nested loop scans too, where the keys > just happen to be clustered. An interesting point. At the moment I'm not sure whether it's possible to implement skipping via index_rescan or not, need to take a look. But checking if the new keys are on the same page would introduce some overhead I guess, wouldn't it be too invasive to add it into already existing btree AM? > - Does this optimization apply to bitmap index scans? No, from what I understand it doesn't. > - This logic in build_index_paths() is not correct: > > > + /* > > +* Skip scan is not supported when there are qual conditions, > > which are not > > +* covered by index. The reason for that is that those > > conditions are > > +* evaluated later, already after skipping was applied. > > +* > > +* TODO: This implementation is too restrictive, and doesn't > > allow e.g. > > +* index expressions. For that we need to examine index_clauses > > too. > > +*/ > > + if (root->parse->jointree != NULL) > > + { > > + ListCell *lc; > > + > > + foreach(lc, (List *)root->parse->jointree->quals) > > + { > > + Node *expr, *qual = (Node *) lfirst(lc); > > + Var *var; > > + bool found = false; > > + > > + if (!is_opclause(qual)) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + > > + expr = get_leftop(qual); > > + > > + if (!IsA(expr, Var)) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + > > + var = (Var *) expr; > > + > > + for (int i = 0; i < index->ncolumns; i++) > > + { > > + if (index->indexkeys[i] == > > var->varattno) > > + { > > + found = true; > > + break; > > + } > > + } > > + > > + if (!found) > > + { > > + not_empty_qual = true; > > + break; > > + } > > + } > > + } > > If you care whether the qual is evaluated by the index AM or not, you need > to also check that the operator is indexable. Attached is a query that > demonstrates that problem. > ... > Also, you should probably check that the index quals are in the operator > family as that used for the DISTINCT. Yes, good point, will change this in the next version. > I'm actually a bit confused why we need this condition. The IndexScan > executor node should call amskip() only after checking the additional quals, > no? This part I don't quite get, what exactly you mean by checking the additional quals in the executor node? But at the end of the day this condition was implemented exactly to address the described issue, which was found later and added to the tests.
Re: Index Skip Scan (new UniqueKeys)
On 24/10/2020 19:45, Dmitry Dolgov wrote: Here is a new version which doesn't require "scanstart" argument and contains few other changes to address the issues mentioned earlier. It's also based on the latest UniqueKeys patches with the valgrind issue fixed (as before they're attached also just for the references, you can find more in the original thread). I didn't rework commentaries yet, will post it soon (need to get an inspiration first, probably via reading Shakespeare unless someone has better suggestions). I had a quick look at this patch. I haven't been following this thread, so sorry if I'm repeating old arguments, but here we go: - I'm surprised you need a new index AM function (amskip) for this. Can't you just restart the scan with index_rescan()? The btree AM can check if the new keys are on the same page, and optimize the rescan accordingly, like amskip does. That would speed up e.g. nested loop scans too, where the keys just happen to be clustered. - Does this optimization apply to bitmap index scans? - This logic in build_index_paths() is not correct: + /* +* Skip scan is not supported when there are qual conditions, which are not +* covered by index. The reason for that is that those conditions are +* evaluated later, already after skipping was applied. +* +* TODO: This implementation is too restrictive, and doesn't allow e.g. +* index expressions. For that we need to examine index_clauses too. +*/ + if (root->parse->jointree != NULL) + { + ListCell *lc; + + foreach(lc, (List *)root->parse->jointree->quals) + { + Node *expr, *qual = (Node *) lfirst(lc); + Var *var; + bool found = false; + + if (!is_opclause(qual)) + { + not_empty_qual = true; + break; + } + + expr = get_leftop(qual); + + if (!IsA(expr, Var)) + { + not_empty_qual = true; + break; + } + + var = (Var *) expr; + + for (int i = 0; i < index->ncolumns; i++) + { + if (index->indexkeys[i] == var->varattno) + { + found = true; + break; + } + } + + if (!found) + { + not_empty_qual = true; + break; + } + } + } If you care whether the qual is evaluated by the index AM or not, you need to also check that the operator is indexable. Attached is a query that demonstrates that problem. I'm actually a bit confused why we need this condition. The IndexScan executor node should call amskip() only after checking the additional quals, no? Also, you should probably check that the index quals are in the operator family as that used for the DISTINCT. - Heikki index-skipscan-bug.sql Description: application/sql
Re: Index Skip Scan (new UniqueKeys)
> On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote: > > * I see the following compiler warning: > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: > In function ‘populate_baserel_uniquekeys’: > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: > warning: ‘expr’ may be used uninitialized in this function > [-Wmaybe-uninitialized] > 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) > | ^~ This is mostly for UniqueKeys patch, which is attached here only as a dependency, but I'll prepare changes for that. Interesting enough I can't reproduce this warning, but if I understand correctly gcc has some history of spurious uninitialized warnings, so I guess it could be version dependent. > * Perhaps the warning is related to this nearby code that I noticed > Valgrind complains about: > > ==1083468== VALGRINDERROR-BEGIN > ==1083468== Invalid read of size 4 > ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) > ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) This also belongs to UniqueKeys patch, but at least I can reproduce this one. My guess is that nkeycolums should be used there, not ncolums, which is visible in index_incuding tests. The same as previous one, will prepare corresponding changes. > * Do we really need the AM-level boolean flag/argument named > "scanstart"? Why not just follow the example of btgettuple(), which > determines whether or not the scan has been initialized based on the > current scan position? > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you > cannot or should not take the same approach as btgettuple(). And even > if you can't take exactly the same approach, I would still think that > the scan's opaque B-Tree state should remember if it's the first call > to _bt_skip() (rather than some subsequent call) in some other way > (e.g. carrying a "scanstart" bool flag directly). Yes, agree, carrying this flag inside the opaque state would be better. > * Why is it okay to do anything important based on the > _bt_scankey_within_page() return value? > > If the page is empty, then how can we know that it's okay to go to the > next value? I'm concerned that there could be subtle bugs in this > area. VACUUM will usually just delete the empty page. But it won't > always do so, for a variety of reasons that aren't worth going into > now. This could mask bugs in this area. I'm concerned about patterns > like this one from _bt_skip(): > > while (!nextFound) > { > > > if (_bt_scankey_within_page(scan, so->skipScanKey, > so->currPos.buf, dir)) > { > ... > } > else > /* > * If startItup could be not found within the current > page, > * assume we found something new > */ > nextFound = true; > > } > > Why would you assume that "we found something new" here? In general I > just don't understand the design of _bt_skip(). I get the basic idea > of what you're trying to do, but it could really use better comments. Yeah, I'll put more efforts into clear comments. There are two different ways in which _bt_scankey_within_page is being used. The first one is to check if it's possible to skip traversal of the tree from root in case if what we're looking for could be on the current page. In this case an empty page would mean we need to search from the root, so not sure what could be the issue here? The second one (that you've highlighted above) I admit is probably the most questionable part of the patch and open for suggestions how to improve it. It's required for one particular case with a cursor when scan advances forward but reads backward. What could happen here is we found one valid item, but the next one e.g. do not pass scan key conditions, and we end up with the previous item again. I'm not entirely sure how presence of an empty page could change this scenario, could you please show an example? > *The "jump one more time if it's the same as at the beginning" thing > seems scary to me. Maybe you should be doing something with the actual > high key here. Same as for the previous question, can you give a hint what do you mean by "doing something with the actual high key"?
Re: Index Skip Scan (new UniqueKeys)
On Mon, Sep 21, 2020 at 5:59 PM Peter Geoghegan wrote: > That's all I have for now. One more thing. I don't think that this should be a bitwise AND: if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE)) { } -- Peter Geoghegan
Re: Index Skip Scan (new UniqueKeys)
On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Here is a new version that hopefully address most of the concerns > mentioned in this thread so far. As before, first two patches are taken > from UniqueKeys thread and attached only for the reference. List of > changes includes: Some thoughts on this version of the patch series (I'm focussing on v36-0005-Btree-implementation-of-skipping.patch again): * I see the following compiler warning: /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c: In function ‘populate_baserel_uniquekeys’: /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13: warning: ‘expr’ may be used uninitialized in this function [-Wmaybe-uninitialized] 797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr)) | ^~ * Perhaps the warning is related to this nearby code that I noticed Valgrind complains about: ==1083468== VALGRINDERROR-BEGIN ==1083468== Invalid read of size 4 ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771) ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140) ==1083468==by 0x56AEA5: set_plain_rel_size (allpaths.c:586) ==1083468==by 0x56AADB: set_rel_size (allpaths.c:412) ==1083468==by 0x56A8CD: set_base_rel_sizes (allpaths.c:323) ==1083468==by 0x56A5A7: make_one_rel (allpaths.c:185) ==1083468==by 0x5AB426: query_planner (planmain.c:269) ==1083468==by 0x5AF02C: grouping_planner (planner.c:2058) ==1083468==by 0x5AD202: subquery_planner (planner.c:1015) ==1083468==by 0x5ABABF: standard_planner (planner.c:405) ==1083468==by 0x5AB7F8: planner (planner.c:275) ==1083468==by 0x6E6F84: pg_plan_query (postgres.c:875) ==1083468==by 0x6E70C4: pg_plan_queries (postgres.c:966) ==1083468==by 0x6E7497: exec_simple_query (postgres.c:1158) ==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309) ==1083468==by 0x624284: BackendRun (postmaster.c:4541) ==1083468==by 0x623995: BackendStartup (postmaster.c:4225) ==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742) ==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415) ==1083468==by 0x514AF2: main (main.c:209) ==1083468== Address 0x75f13e0 is 4,448 bytes inside a block of size 8,192 alloc'd ==1083468==at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==1083468==by 0x8C15C8: AllocSetAlloc (aset.c:919) ==1083468==by 0x8CEA52: palloc (mcxt.c:964) ==1083468==by 0x267F25: systable_beginscan (genam.c:373) ==1083468==by 0x8682CE: SearchCatCacheMiss (catcache.c:1359) ==1083468==by 0x868167: SearchCatCacheInternal (catcache.c:1299) ==1083468==by 0x867E2C: SearchCatCache1 (catcache.c:1167) ==1083468==by 0x8860B2: SearchSysCache1 (syscache.c:1123) ==1083468==by 0x8BD482: check_enable_rls (rls.c:66) ==1083468==by 0x68A113: get_row_security_policies (rowsecurity.c:134) ==1083468==by 0x683C2C: fireRIRrules (rewriteHandler.c:2045) ==1083468==by 0x687340: QueryRewrite (rewriteHandler.c:3962) ==1083468==by 0x6E6EB1: pg_rewrite_query (postgres.c:784) ==1083468==by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700) ==1083468==by 0x6E7476: exec_simple_query (postgres.c:1155) ==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309) ==1083468==by 0x624284: BackendRun (postmaster.c:4541) ==1083468==by 0x623995: BackendStartup (postmaster.c:4225) ==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742) ==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415) ==1083468== ==1083468== VALGRINDERROR-END (You'll see the same error if you run Postgres Valgrind + "make installcheck", though I don't think that the queries in question are tests that you yourself wrote.) * IndexScanDescData.xs_itup comments could stand to be updated here -- IndexScanDescData.xs_want_itup is no longer just about index-only scans. * Do we really need the AM-level boolean flag/argument named "scanstart"? Why not just follow the example of btgettuple(), which determines whether or not the scan has been initialized based on the current scan position? Just because you set so->currPos.buf to InvalidBuffer doesn't mean you cannot or should not take the same approach as btgettuple(). And even if you can't take exactly the same approach, I would still think that the scan's opaque B-Tree state should remember if it's the first call to _bt_skip() (rather than some subsequent call) in some other way (e.g. carrying a "scanstart" bool flag directly). A part of my objection to "scanstart" is that it seems to require that much of the code within _bt_skip() get another level of indentation...which makes it even more difficult to follow. * I don't understand what _bt_scankey_within_page() comments mean when they refer to "the page highkey". It looks like this function examines the highest data
Re: Index Skip Scan (new UniqueKeys)
Hi David: Thanks for looking into this. On Fri, Jul 31, 2020 at 11:07 AM David Rowley wrote: > On Mon, 13 Jul 2020 at 10:18, Floris Van Nee > wrote: > > One question about the unique keys - probably for Andy or David: I've > looked in the archives to find arguments for/against using Expr nodes or > EquivalenceClasses in the Unique Keys patch. However, I couldn't really > find a clear answer about why the current patch uses Expr rather than > EquivalenceClasses. At some point David mentioned "that probably Expr nodes > were needed rather than EquivalenceClasses", but it's not really clear to > me why. What were the thoughts behind this? > > I'm still not quite sure on this either way. I did think > EquivalenceClasses were more suitable before I wrote the POC patch for > unique keys. But after that, I had in mind that Exprs might be > better. The reason I thought this was due to the fact that the > DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were > EquivalenceClasses then checking to see if the DISTINCT can be skipped > turned into something more complex that required looking through lists > of ec_members rather than just checking if the uniquekey exprs were a > subset of the DISTINCT clause. > > Thinking about it a bit harder, if we did use Exprs then it would mean > it a case like the following wouldn't work for Andy's DISTINCT no-op > stuff. > > CREATE TABLE xy (x int primary key, y int not null); > > SELECT DISTINCT y FROM xy WHERE x=y; > > whereas if we use EquivalenceClasses then we'll find that we have an > EC with x,y in it and can skip the DISTINCT since we have a UniqueKey > containing that EquivalenceClass. > Also, looking at what Andy wrote to make a case like the following > work in his populate_baserel_uniquekeys() function in the 0002 patch: > > CREATE TABLE ab (a int, b int, primary key(a,b)); > SELECT DISTINCT a FROM ab WHERE b = 1; > > it's a bit uninspiring. Really what we want here when checking if we > can skip doing the DISTINCT is a UniqueKey set using > EquivalenceClasses as we can just insist that any unmatched UniqueKey > items have an ec_is_const == true. However, that means we have to loop > through the ec_members of the EquivalenceClasses in the uniquekeys > during the DISTINCT check. That's particularly bad when you consider > that in a partitioned table case there might be an ec_member for each > child partition and there could be 1000s of child partitions and > following those ec_members chains is going to be too slow. > > My current thoughts are that we should be using EquivalenceClasses but > we should first add some infrastructure to make them perform better. > My current thoughts are that we do something like what I mentioned in > [1] or something more like what Andres mentions in [2]. After that, > we could either make EquivalenceClass.ec_members a hash table or > binary search tree. Or even perhaps just have a single hash table/BST > for all EquivalenceClasses that allows very fast lookups from {Expr} > -> {EquivalenceClass}. I think an Expr can only belong in a single > non-merged EquivalenceClass. So when we do merging of > EquivalenceClasses we could just repoint that data structure to point > to the new EquivalenceClass. We'd never point to ones that have > ec_merged != NULL. This would also allow us to fix the poor > performance in regards to get_eclass_for_sort_expr() for partitioned > tables. > > So, it seems the patch dependency chain for skip scans just got a bit > longer :-( > > I admit that EquivalenceClasses has a better expressive power. There are 2 more cases we can handle better with EquivalenceClasses. SELECT DISTINCT a, b, c FROM t WHERE a = b; Currently the UniqueKey is (a, b, c), but it is better be (a, c) and (b, c). The other case happens similarly in group by case. After realizing this, I am still hesitant to do that, due to the complexity. If we do that, we may have to maintain a EquivalenceClasses in one more place or make the existing EquivalenceClasses List longer, for example: SELECT pk FROM t; The current infrastructure doesn't create any EquivalenceClasses for pk. So we have to create a new one in this case and reuse some existing ones in other cases. Finally since the EquivalenceClasses is not so straight to upper user, we have to depend on the infrastructure change to look up an EquivalenceClasses quickly from an Expr. I rethink more about the case you provide above, IIUC, there is such issue for joinrel. then we can just add a EC checking for populate_baserel_uniquekeys. As for the DISTINCT/GROUP BY case, we should build the UniqueKeys from root->distinct_pathkeys and root->group_pathkeys where the EquivalenceClasses are already there. I am still not insisting on either Expr or EquivalenceClasses right now, if we need to change it to EquivalenceClasses, I'd see if we need to have more places to take care before doing that. -- Best Regards Andy Fan
Re: Index Skip Scan (new UniqueKeys)
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee wrote: > One question about the unique keys - probably for Andy or David: I've looked > in the archives to find arguments for/against using Expr nodes or > EquivalenceClasses in the Unique Keys patch. However, I couldn't really find > a clear answer about why the current patch uses Expr rather than > EquivalenceClasses. At some point David mentioned "that probably Expr nodes > were needed rather than EquivalenceClasses", but it's not really clear to me > why. What were the thoughts behind this? I'm still not quite sure on this either way. I did think EquivalenceClasses were more suitable before I wrote the POC patch for unique keys. But after that, I had in mind that Exprs might be better. The reason I thought this was due to the fact that the DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were EquivalenceClasses then checking to see if the DISTINCT can be skipped turned into something more complex that required looking through lists of ec_members rather than just checking if the uniquekey exprs were a subset of the DISTINCT clause. Thinking about it a bit harder, if we did use Exprs then it would mean it a case like the following wouldn't work for Andy's DISTINCT no-op stuff. CREATE TABLE xy (x int primary key, y int not null); SELECT DISTINCT y FROM xy WHERE x=y; whereas if we use EquivalenceClasses then we'll find that we have an EC with x,y in it and can skip the DISTINCT since we have a UniqueKey containing that EquivalenceClass. Also, looking at what Andy wrote to make a case like the following work in his populate_baserel_uniquekeys() function in the 0002 patch: CREATE TABLE ab (a int, b int, primary key(a,b)); SELECT DISTINCT a FROM ab WHERE b = 1; it's a bit uninspiring. Really what we want here when checking if we can skip doing the DISTINCT is a UniqueKey set using EquivalenceClasses as we can just insist that any unmatched UniqueKey items have an ec_is_const == true. However, that means we have to loop through the ec_members of the EquivalenceClasses in the uniquekeys during the DISTINCT check. That's particularly bad when you consider that in a partitioned table case there might be an ec_member for each child partition and there could be 1000s of child partitions and following those ec_members chains is going to be too slow. My current thoughts are that we should be using EquivalenceClasses but we should first add some infrastructure to make them perform better. My current thoughts are that we do something like what I mentioned in [1] or something more like what Andres mentions in [2]. After that, we could either make EquivalenceClass.ec_members a hash table or binary search tree. Or even perhaps just have a single hash table/BST for all EquivalenceClasses that allows very fast lookups from {Expr} -> {EquivalenceClass}. I think an Expr can only belong in a single non-merged EquivalenceClass. So when we do merging of EquivalenceClasses we could just repoint that data structure to point to the new EquivalenceClass. We'd never point to ones that have ec_merged != NULL. This would also allow us to fix the poor performance in regards to get_eclass_for_sort_expr() for partitioned tables. So, it seems the patch dependency chain for skip scans just got a bit longer :-( David [1] https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de#c3add3919c534591eae2179a6c82742c
Re: Index Skip Scan (new UniqueKeys)
> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote: > > > Well, it's obviously wrong, thanks for noticing. What is necessary is to > > compare two index tuples, the start and the next one, to test if they're > > the same (in which case if I'm not mistaken probably we can compare item > > pointers). I've got this question when I was about to post a new version > > with changes to address feedback from Andy, now I'll combine them and > > send a cumulative patch. > > This sounds like approximately the same problem as the one that > _bt_killitems() has to deal with as of Postgres 13. This is handled in > a way that is admittedly pretty tricky, even though the code does not > need to be 100% certain that it's "the same" tuple. Deduplication kind > of makes that a fuzzy concept. In principle there could be one big > index tuple instead of 5 tuples, even though the logical contents of > the page have not been changed between the time we recording heap TIDs > in local and the time _bt_killitems() tried to match on those heap > TIDs to kill_prior_tuple-kill some index tuples -- a concurrent > deduplication pass could do that. Your code needs to be prepared for > stuff like that. > > Ultimately posting list tuples are just a matter of understanding the > on-disk representation -- a "Small Matter of Programming". Even > without deduplication there are potential hazards from the physical > deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is > not code that runs in VACUUM, despite the name). Make sure that you > hold a buffer pin on the leaf page throughout, because you need to do > that to make sure that VACUUM cannot concurrently recycle heap TIDs. > If VACUUM *is* able to concurrently recycle heap TIDs then it'll be > subtly broken. _bt_killitems() is safe because it either holds on to a > pin or gives up when the LSN changes at all. (ISTM that your only > choice is to hold on to a leaf page pin, since you cannot just decide > to give up in the way that _bt_killitems() sometimes can.) I see, thanks for clarification. You're right, in this part of implementation there is no way to give up if LSN changes like _bt_killitems does. As far as I can see the leaf page is already pinned all the time between reading relevant tuples and comparing them, I only need to handle posting list tuples.
RE: Index Skip Scan (new UniqueKeys)
> > One UniqueKey can have multiple corresponding expressions, which gives us > also possibility of having one unique key with (t1.a, t2.a) and it looks now > similar to EquivalenceClass. > I believe the current definition of a unique key with two expressions (t1.a, t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker guarantees than uniqueness on (t1.a) and uniqueness on (t2.a). > > The idea behind this query sounds questionable to me, more transparent > would be to do this without distinct, skipping will actually do exactly the > same > stuff just under another name. But if allowing skipping on constants do not > bring significant changes in the code probably it's fine. > Yeah indeed, I didn't say it's a query that people should generally write. :-) It's better to write as a regular SELECT with LIMIT 1 of course. However, it's more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) * FROM t1 runs fast, then it doesn't make sense to the user if a SELECT DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the implementation more consistent with little code changes. > > > > Yeah, there's definitely some double work there, but the actual impact may > be limited - it doesn't actually allocate a new path key, but it looks it up > in > root->canon_pathkeys and returns that path key. > > I wrote it like this, because I couldn't find a way to identify from a > > certain > PathKey the actual location in the index of that column. The constructed path > keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes > path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But > how to get from PathKey=b to that number 3, I didn't find a solid way except > doing this. Maybe there is though? > > I don't think there is a direct way, but why not modify build_index_paths to > also provide this information, or compare index_pathkeys expressions with > indextlist without actually create those pathkeys again? > I agree there could be other ways - I don't currently have a strong preference for either. I can have a look at this later. > And couple of words about this thread [1]. It looks to me like a strange way > of interacting with the community. Are you going to duplicate there > everything, or what are your plans? At the very least you could try to include > everyone involved in the recipients list, not exclude some of the authors. > When I wrote the first mail in the thread, I went to this thread [1] and included everyone from there, but I see now that I only included the to: and cc: people and forgot the original thread author, you. I'm sorry about that - I should've looked better to make sure I had everyone. In any case, my plan is to keep the patch at least applicable to master, as I believe it can be helpful for discussions about both patches. [1] https://www.postgresql.org/message-id/20200609102247.jdlatmfyeecg52fi%40localhost
Re: Index Skip Scan (new UniqueKeys)
> On Tue, Jul 14, 2020 at 06:18:50PM +, Floris Van Nee wrote: > > Due to the other changes I made in > create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan > now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to > give an actual failing test case now. However, since all code filters out > constant keys, I think uniqueness should do it too - otherwise you could get > into problems later on. It's also more consistent. If you already know > something is unique by just (b), it doesn't make sense to store that it's > unique by (a,b). Now that I think of it, the best place to do this > EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, > rather than build_uniquekeys though. It's probably good to move it there. That would be my suggestion as well. > > Along the lines I'm also curious about this part: > > > > - ListCell *k; > > - List *exprs = NIL; > > - > > - foreach(k, ec->ec_members) > > - { > > - EquivalenceMember *mem = (EquivalenceMember *) > > lfirst(k); > > - exprs = lappend(exprs, mem->em_expr); > > - } > > - > > - result = lappend(result, makeUniqueKey(exprs, false, false)); > > + EquivalenceMember *mem = (EquivalenceMember*) > > +lfirst(list_head(ec->ec_members)); > > > > I'm curious about this myself, maybe someone can clarify. It looks like > > generaly speaking there could be more than one member (if not > > ec_has_volatile), which "representing knowledge that multiple items are > > effectively equal". Is this information is not interesting enough to > > preserve it > > in unique keys? > > Yeah, that's a good question. Hence my question about the choice for Expr > rather than EquivalenceClass for the Unique Keys patch to Andy/David. When > storing just Expr, it is rather difficult to check equivalence in joins for > example. Suppose, later on we decide to add join support to the distinct skip > scan. Consider a query like this: > SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a > As far as my understanding goes (I didn't look into it in detail though), I > think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That > results in a UniqueKey with expr (t1.a) (because currently we only take the > first Expr in the list to construct the UniqueKey). We could also construct > *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't > think that's the correct approach either, as it will explode when you have > multiple pathkeys, each having multiple Expr inside their EqClasses. One UniqueKey can have multiple corresponding expressions, which gives us also possibility of having one unique key with (t1.a, t2.a) and it looks now similar to EquivalenceClass. > > > - the distinct_pathkeys may be NULL, even though there's a possibility for > > skipping. But it wouldn't create the uniquekeys in this case. This makes the > > planner not choose skip scans even though it could. For example in queries > > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a > > is constant, it's eliminated from regular pathkeys. > > > > What would be the point of skipping if it's a constant? > > For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; > There may be 1000s of records with a=1. We're only interested in the first > one though. The traditional non-skip approach would still scan all records > with a=1. Skip would just fetch the first one with a=1 and then skip to the > next prefix and stop the scan. The idea behind this query sounds questionable to me, more transparent would be to do this without distinct, skipping will actually do exactly the same stuff just under another name. But if allowing skipping on constants do not bring significant changes in the code probably it's fine. > > > - to combat the issues mentioned earlier, there's now a check in > > build_index_paths that checks if the query_pathkeys matches the > > useful_pathkeys. Note that we have to use the path keys here rather than > > any of the unique keys. The unique keys are only Expr nodes - they do not > > contain the necessary information about ordering. Due to elimination of > > some constant path keys, we have to search the attributes of the index to > > find the correct prefix to use in skipping. > > > > IIUC here you mean this function, right? > > > > + prefix = find_index_prefix_for_pathkey(root, > > + > > index, > > + > > BackwardScanDirection, > > + > > llast_node(PathKey, > > + > > root->distinct_pathkeys)); > > > > Doesn't it duplicate the job already done in build_index_pathkeys by > > building > > those pathkeys again? If yes, probably it's possible to reuse > > useful_pathkeys. > > Not sure about unordered indexes, but looks like query_pathkeys should > > also match in this case. > > > > Yeah, there's definitely some double work there, but the actual impact may be > limited - it doesn't actually allocate a new path key, but it looks
Re: Index Skip Scan (new UniqueKeys)
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > + currItem = >currPos.items[so->currPos.lastItem]; > > > + itup = (IndexTuple) (so->currTuples + > > > currItem->tupleOffset); > > > + nextOffset = ItemPointerGetOffsetNumber(>t_tid); > > Do you mean this last part with t_tid, which could also have a tid array > in case of posting tuple format? Yeah. There is a TID array at the end of the index when the tuple is a posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't safe to assume that t_tid is a heap TID for this reason, even in code that only ever considers data items (that is, non high-key tuples AKA non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot be posting list tuples either, so you'll see the assumption that t_tid is just a heap TID in parts of nbtinsert.c -- though only for the new/incoming item.) > Well, it's obviously wrong, thanks for noticing. What is necessary is to > compare two index tuples, the start and the next one, to test if they're > the same (in which case if I'm not mistaken probably we can compare item > pointers). I've got this question when I was about to post a new version > with changes to address feedback from Andy, now I'll combine them and > send a cumulative patch. This sounds like approximately the same problem as the one that _bt_killitems() has to deal with as of Postgres 13. This is handled in a way that is admittedly pretty tricky, even though the code does not need to be 100% certain that it's "the same" tuple. Deduplication kind of makes that a fuzzy concept. In principle there could be one big index tuple instead of 5 tuples, even though the logical contents of the page have not been changed between the time we recording heap TIDs in local and the time _bt_killitems() tried to match on those heap TIDs to kill_prior_tuple-kill some index tuples -- a concurrent deduplication pass could do that. Your code needs to be prepared for stuff like that. Ultimately posting list tuples are just a matter of understanding the on-disk representation -- a "Small Matter of Programming". Even without deduplication there are potential hazards from the physical deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is not code that runs in VACUUM, despite the name). Make sure that you hold a buffer pin on the leaf page throughout, because you need to do that to make sure that VACUUM cannot concurrently recycle heap TIDs. If VACUUM *is* able to concurrently recycle heap TIDs then it'll be subtly broken. _bt_killitems() is safe because it either holds on to a pin or gives up when the LSN changes at all. (ISTM that your only choice is to hold on to a leaf page pin, since you cannot just decide to give up in the way that _bt_killitems() sometimes can.) Note that the rules surrounding buffer locks/pins for nbtree were tightened up a tiny bit today -- see commit 4a70f829. Also, it's no longer okay to use raw LockBuffer() calls in nbtree, so you're going to have to fix that up when you next rebase -- you must use the new _bt_lockbuf() wrapper function instead, so that the new Valgrind instrumentation is used. This shouldn't be hard. Perhaps you can use Valgrind to verify that this patch doesn't have any unsafe buffer accesses. I recall problems like that in earlier versions of the patch series. Valgrind should be able to detect most bugs like that (though see comments within _bt_lockbuf() for details of a bug in this area that Valgrind cannot detect even now). -- Peter Geoghegan
Re: Generic Index Skip Scan
On Thu, 16 Jul 2020 at 07:52, Floris Van Nee wrote: > > Besides the great efforts that Dmitry et al. are putting into the skip scan > for DISTINCT queries [1], I’m also still keen on extending the use of it > further. I’d like to address the limited cases in which skipping can occur > here. A few months ago I shared an initial rough patch that provided a > generic skip implementation, but lacked the proper planning work [2]. I’d > like to share a second patch set that provides an implementation of the > planner as well. Perhaps this can lead to some proper discussions how we’d > like to shape this patch further. > > Please see [2] for an introduction and some rough performance comparisons. > This patch improves upon those, because it implements proper cost estimation > logic. It will now only choose the skip scan if it’s deemed to be cheaper > than using a regular index scan. Other than that, all the features are still > there. The skip scan can be used in many more types of queries than in the > original DISTINCT patch as provided in [1], making it more performant and > also more predictable for users. > > I’m keen on receiving feedback on this idea and on the patch. I don't think anyone ever thought the feature would be limited to just making DISTINCT go faster. There's certain to be more usages in the future. However, for me it would be premature to look at this now. David
RE: Index Skip Scan (new UniqueKeys)
> > > - the uniquekeys that is built, still contains some redundant keys, that are > normally eliminated from the path keys lists. > > I guess you're talking about: > > + if (EC_MUST_BE_REDUNDANT(ec)) > + continue; > > Can you add some test cases to your changes to show the effect of it? It > seem to me redundant keys are already eliminated at this point by either > make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be > I've missed something. > The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks for uniqueness using pathkey_is_unique, but not for constantness. Consider a query like: SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c All the regular path keys filter out 'a' for constantness here - so they would end up with a distinct pathkeys of (b) and sort path keys of (b,c). The unique keys would end up with (a,b) though. In the previous patch, it'd compared in create_distinct_paths, the pathkeys to the unique keys, so it wouldn't consider the skip scan. Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there. > Along the lines I'm also curious about this part: > > - ListCell *k; > - List *exprs = NIL; > - > - foreach(k, ec->ec_members) > - { > - EquivalenceMember *mem = (EquivalenceMember *) > lfirst(k); > - exprs = lappend(exprs, mem->em_expr); > - } > - > - result = lappend(result, makeUniqueKey(exprs, false, false)); > + EquivalenceMember *mem = (EquivalenceMember*) > +lfirst(list_head(ec->ec_members)); > > I'm curious about this myself, maybe someone can clarify. It looks like > generaly speaking there could be more than one member (if not > ec_has_volatile), which "representing knowledge that multiple items are > effectively equal". Is this information is not interesting enough to preserve > it > in unique keys? Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this: SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses. That makes it difficult to check if we can perform the DISTINCT skip scan on table t2 as well (theoretically we could, but for that we need to check equivalence classes rather than expressions). > > > - the distinct_pathkeys may be NULL, even though there's a possibility for > skipping. But it wouldn't create the uniquekeys in this case. This makes the > planner not choose skip scans even though it could. For example in queries > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a > is constant, it's eliminated from regular pathkeys. > > What would be the point of skipping if it's a constant? For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan. > > > - to combat the issues mentioned earlier, there's now a check in > build_index_paths that checks if the query_pathkeys matches the > useful_pathkeys. Note that we have to use the path keys here rather than > any of the unique keys. The unique keys are only Expr nodes - they do not > contain the necessary information about ordering. Due to elimination of > some constant path keys, we have to search the attributes of the index to > find the correct prefix to use in skipping. > > IIUC here you mean this function, right? > > + prefix = find_index_prefix_for_pathkey(root,
Re: Index Skip Scan (new UniqueKeys)
> On Sun, Jul 12, 2020 at 12:48:47PM +, Floris Van Nee wrote: > > > > Good point, thanks for looking at this. With the latest planner version > > there > > are indeed more possibilities to use skipping. It never occured to me that > > some of those paths will still rely on index scan returning full data set. > > I'll look > > in details and add verification to prevent putting something like this on > > top of > > skip scan in the next version. > > I believe the required changes are something like in attached patch. There > were a few things I've changed: > - build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, > it would create two unique keys, one with a and one with b. However, it > should be one unique key with (a,b). Yes, I've also noticed that while preparing fix for index scan not covered by index and included it. > - the uniquekeys that is built, still contains some redundant keys, that are > normally eliminated from the path keys lists. I guess you're talking about: + if (EC_MUST_BE_REDUNDANT(ec)) + continue; Can you add some test cases to your changes to show the effect of it? It seem to me redundant keys are already eliminated at this point by either make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be I've missed something. Along the lines I'm also curious about this part: - ListCell *k; - List *exprs = NIL; - - foreach(k, ec->ec_members) - { - EquivalenceMember *mem = (EquivalenceMember *) lfirst(k); - exprs = lappend(exprs, mem->em_expr); - } - - result = lappend(result, makeUniqueKey(exprs, false, false)); + EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members)); I'm curious about this myself, maybe someone can clarify. It looks like generaly speaking there could be more than one member (if not ec_has_volatile), which "representing knowledge that multiple items are effectively equal". Is this information is not interesting enough to preserve it in unique keys? > - the distinct_pathkeys may be NULL, even though there's a possibility for > skipping. But it wouldn't create the uniquekeys in this case. This makes the > planner not choose skip scans even though it could. For example in queries > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is > constant, it's eliminated from regular pathkeys. What would be the point of skipping if it's a constant? > - to combat the issues mentioned earlier, there's now a check in > build_index_paths that checks if the query_pathkeys matches the > useful_pathkeys. Note that we have to use the path keys here rather than any > of the unique keys. The unique keys are only Expr nodes - they do not contain > the necessary information about ordering. Due to elimination of some constant > path keys, we have to search the attributes of the index to find the correct > prefix to use in skipping. IIUC here you mean this function, right? + prefix = find_index_prefix_for_pathkey(root, + index, + BackwardScanDirection, + llast_node(PathKey, + root->distinct_pathkeys)); Doesn't it duplicate the job already done in build_index_pathkeys by building those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys. Not sure about unordered indexes, but looks like query_pathkeys should also match in this case. Will also look at the follow up questions in the next email.
Re: Index Skip Scan (new UniqueKeys)
> On Fri, Jul 10, 2020 at 05:03:37PM +, Floris Van Nee wrote: > > Also took another look at the patch now, and found a case of incorrect > data. It looks related to the new way of creating the paths, as I > can't recall seeing this in earlier versions. > > create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, > 10) a, generate_series(1,100) b; > create index on t1 (a,b,c); > > postgres=# explain select distinct on (a) * from t1 order by a,b desc,c; > QUERY PLAN > --- > Sort (cost=2.92..2.94 rows=10 width=20) >Sort Key: a, b DESC, c >-> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20) > Skip scan: true > (4 rows) Good point, thanks for looking at this. With the latest planner version there are indeed more possibilities to use skipping. It never occured to me that some of those paths will still rely on index scan returning full data set. I'll look in details and add verification to prevent putting something like this on top of skip scan in the next version.
Re: Index Skip Scan (new UniqueKeys)
> On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote: > > On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > * Btree-implementation contains btree specific code to implement amskip, > > introduced in the previous patch. > > The way that you're dealing with B-Tree tuples here needs to account > for posting list tuples: > > > + currItem = >currPos.items[so->currPos.lastItem]; > > + itup = (IndexTuple) (so->currTuples + > > currItem->tupleOffset); > > + nextOffset = ItemPointerGetOffsetNumber(>t_tid); Do you mean this last part with t_tid, which could also have a tid array in case of posting tuple format? > > + /* > > +* To check if we returned the same tuple, try to find a > > +* startItup on the current page. For that we need to update > > +* scankey to match the whole tuple and set nextkey to > > return > > +* an exact tuple, not the next one. If the nextOffset is > > the > > +* same as before, it means we are in the loop, return > > offnum > > +* to the original position and jump further > > +*/ > > Why does it make sense to use the offset number like this? It isn't > stable or reliable. The patch goes on to do this: > > > + startOffset = _bt_binsrch(scan->indexRelation, > > + so->skipScanKey, > > + so->currPos.buf); > > + > > + page = BufferGetPage(so->currPos.buf); > > + maxoff = PageGetMaxOffsetNumber(page); > > + > > + if (nextOffset <= startOffset) > > + { > > Why compare a heap TID's offset number (an offset number for a heap > page) to another offset number for a B-Tree leaf page? They're > fundamentally different things. Well, it's obviously wrong, thanks for noticing. What is necessary is to compare two index tuples, the start and the next one, to test if they're the same (in which case if I'm not mistaken probably we can compare item pointers). I've got this question when I was about to post a new version with changes to address feedback from Andy, now I'll combine them and send a cumulative patch.
RE: Index Skip Scan (new UniqueKeys)
Hi Dmitry, Also took another look at the patch now, and found a case of incorrect data. It looks related to the new way of creating the paths, as I can't recall seeing this in earlier versions. create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b; create index on t1 (a,b,c); postgres=# explain select distinct on (a) * from t1 order by a,b desc,c; QUERY PLAN --- Sort (cost=2.92..2.94 rows=10 width=20) Sort Key: a, b DESC, c -> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20) Skip scan: true (4 rows) With the 'order by a, b desc, c' we expect the value of column 'b' to always be 100. With index_skipscan enabled, it always gives 1 though. It's not correct that the planner chooses a skip scan followed by sort in this case. -Floris
Re: Index Skip Scan (new UniqueKeys)
On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > * Btree-implementation contains btree specific code to implement amskip, > introduced in the previous patch. The way that you're dealing with B-Tree tuples here needs to account for posting list tuples: > + currItem = >currPos.items[so->currPos.lastItem]; > + itup = (IndexTuple) (so->currTuples + currItem->tupleOffset); > + nextOffset = ItemPointerGetOffsetNumber(>t_tid); But I wonder more generally what the idea here is. The following comments that immediately follow provide some hints: > + /* > +* To check if we returned the same tuple, try to find a > +* startItup on the current page. For that we need to update > +* scankey to match the whole tuple and set nextkey to return > +* an exact tuple, not the next one. If the nextOffset is the > +* same as before, it means we are in the loop, return offnum > +* to the original position and jump further > +*/ Why does it make sense to use the offset number like this? It isn't stable or reliable. The patch goes on to do this: > + startOffset = _bt_binsrch(scan->indexRelation, > + so->skipScanKey, > + so->currPos.buf); > + > + page = BufferGetPage(so->currPos.buf); > + maxoff = PageGetMaxOffsetNumber(page); > + > + if (nextOffset <= startOffset) > + { Why compare a heap TID's offset number (an offset number for a heap page) to another offset number for a B-Tree leaf page? They're fundamentally different things. -- Peter Geoghegan
Re: Index Skip Scan (new UniqueKeys)
> On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote: > > I just get the rough idea of patch, looks we have to narrow down the > user cases where we can use this method. Consider the below example: Hi Not exactly narrow down, but rather get rid of wrong usage of skipping for index scan. Since skipping for it was added later than for index only scan I can imagine there are still blind spots, so good that you've looked. In this particular case, when index expressions do not fully cover those expressionse result need to be distinct on, skipping just doesn't have enough information and should not be used. I'll add it to the next version, thanks!
Re: Index Skip Scan (new UniqueKeys)
On Tue, Jun 9, 2020 at 6:20 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > Hi, > > Here is a new version of index skip scan patch, based on v8 patch for > UniqueKeys implementation from [1]. I want to start a new thread to > simplify navigation, hopefully I didn't forget anyone who actively > participated in the discussion. > > To simplify reviewing I've split it into several parts: > > * First two are taken from [1] just for the reference and to make cfbot > happy. > > * Extend-UniqueKeys consists changes that needs to be done for > UniqueKeys to be used in skip scan. Essentially this is a reduced > version of previous implementation from Jesper & David, based on the > new UniqueKeys infrastructure, as it does the very same thing. > > * Index-Skip-Scan contains not am specific code and the overall > infrastructure, including configuration elements and so on. > > * Btree-implementation contains btree specific code to implement amskip, > introduced in the previous patch. > > * The last one contains just documentation bits. > > Interesting enough with a new UniqueKey implementation skipping is > applied in some tests unexpectedly. For those I've disabled > indexskipscan to avoid confusion. > > [1]: > https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com > Thanks for the patch. I just get the rough idea of patch, looks we have to narrow down the user cases where we can use this method. Consider the below example: create table j1(i int, im5 int, im100 int, im1000 int); insert into j1 select i, i%5, i%100, i%1000 from generate_series(1, 1000)i; create index on j1(im5, i); insert into j1 values (1, 1, 0, 0); analyze j1; demo=# select distinct * from j1 where i < 2; i | im5 | im100 | im1000 ---+-+---+ 1 | 1 | 1 | 1 (1 row) demo=# set enable_indexskipscan to off; SET demo=# select distinct * from j1 where i < 2; i | im5 | im100 | im1000 ---+-+---+ 1 | 1 | 0 | 0 1 | 1 | 1 | 1 (2 rows) drop index j1_im5_i_idx; create index on j1(im5, i, im100); demo=# select distinct im5, i, im100 from j1 where i < 2; im5 | i | im100 -+---+--- 1 | 1 | 0 1 | 1 | 1 (2 rows) demo=# set enable_indexskipscan to on; SET demo=# select distinct im5, i, im100 from j1 where i < 2; im5 | i | im100 -+---+--- 1 | 1 | 0 (1 row) -- Best Regards Andy Fan
Re: Index Skip Scan
On Tue, Jun 2, 2020 at 9:38 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote: > > > > > Other than that to summarize current open points for future readers > > > (this thread somehow became quite big): > > > > > > * Making UniqueKeys usage more generic to allow using skip scan for > more > > > use cases (hopefully it was covered by the v33, but I still need a > > > confirmation from David, like blinking twice or something). > > > > > > * Suspicious performance difference between different type of workload, > > > mentioned by Tomas (unfortunately I had no chance yet to > investigate). > > > > > > * Thinking about supporting conditions, that are not covered by the > index, > > > to make skipping more flexible (one of the potential next steps in > the > > > future, as suggested by Floris). > > > > > > > Looks this is the latest patch, which commit it is based on? Thanks > > I have a rebased version, if you're about it. Didn't posted it yet > mostly since I'm in the middle of adapting it to the UniqueKeys from > other thread. Would it be ok for you to wait a bit until I'll post > finished version? > Sure, that's OK. The discussion on UniqueKey thread looks more complex than what I expected, that's why I want to check the code here, but that's fine, you can work on your schedule. -- Best Regards Andy Fan
Re: Index Skip Scan
> On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote: > > > Other than that to summarize current open points for future readers > > (this thread somehow became quite big): > > > > * Making UniqueKeys usage more generic to allow using skip scan for more > > use cases (hopefully it was covered by the v33, but I still need a > > confirmation from David, like blinking twice or something). > > > > * Suspicious performance difference between different type of workload, > > mentioned by Tomas (unfortunately I had no chance yet to investigate). > > > > * Thinking about supporting conditions, that are not covered by the index, > > to make skipping more flexible (one of the potential next steps in the > > future, as suggested by Floris). > > > > Looks this is the latest patch, which commit it is based on? Thanks I have a rebased version, if you're about it. Didn't posted it yet mostly since I'm in the middle of adapting it to the UniqueKeys from other thread. Would it be ok for you to wait a bit until I'll post finished version?
Re: Index Skip Scan
On Wed, Apr 8, 2020 at 3:41 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Mon, Apr 06, 2020 at 06:31:08PM +, Floris Van Nee wrote: > > > > > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, > Floris, I > > > would appreciate if in the future you can make it more visible that > changes you > > > suggest contain some fixes. E.g. it wasn't clear for me from your > previous email > > > that that's the case, and it doesn't make sense to pull into different > direction > > > when we're trying to achieve the same goal :) > > > > I wasn't aware that this particular case could be triggered before I saw > Dilip's email, otherwise I'd have mentioned it here of course. It's just > that because my patch handles filter conditions in general, it works for > this case too. > > Oh, then fortunately I've got a wrong impression, sorry and thanks for > clarification :) > > > > > > In the patch I posted a week ago these cases are all handled > > > > > correctly, as it introduces this extra logic in the Executor. > > > > > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > > > > > I'll definitely take a look at suggested changes in filtering part. > > > > It may be possible to just merge the filtering part into your patch, but > I'm not entirely sure. Basically you have to pull the information about > skipping one level up, out of the node, into the generic IndexNext code. > > I was actually thinking more about just preventing skip scan in this > situation, which is if I'm not mistaken could be solved by inspecting > qual conditions to figure out if they're covered in the index - > something like in attachments (this implementation is actually too > restrictive in this sense and will not allow e.g. expressions, that's > why I haven't bumped patch set version for it - soon I'll post an > extended version). > > Other than that to summarize current open points for future readers > (this thread somehow became quite big): > > * Making UniqueKeys usage more generic to allow using skip scan for more > use cases (hopefully it was covered by the v33, but I still need a > confirmation from David, like blinking twice or something). > > * Suspicious performance difference between different type of workload, > mentioned by Tomas (unfortunately I had no chance yet to investigate). > > * Thinking about supporting conditions, that are not covered by the index, > to make skipping more flexible (one of the potential next steps in the > future, as suggested by Floris). > Looks this is the latest patch, which commit it is based on? Thanks -- Best Regards Andy Fan
Re: Index Skip Scan
On Fri, 15 May 2020 at 6:06 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote: > > > > + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, > dir)) > > + { > > > > Here we expect whether the "next" unique key can be found on this page > > or not, but we are using the function which suggested whether the > > "current" key can be found on this page or not. I think in boundary > > cases where the high key is equal to the current key, this function > > will return true (which is expected from this function), and based on > > that we will simply scan the current page and IMHO that cost could be > > avoided no? > > Yes, looks like you're right, there is indeed an unecessary extra scan > happening. To avoid that we can see the key->nextkey and adjust higher > boundary correspondingly. Will also add this into the next rebased > patch, thanks! Great thanks! > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote: > > + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) > + { > > Here we expect whether the "next" unique key can be found on this page > or not, but we are using the function which suggested whether the > "current" key can be found on this page or not. I think in boundary > cases where the high key is equal to the current key, this function > will return true (which is expected from this function), and based on > that we will simply scan the current page and IMHO that cost could be > avoided no? Yes, looks like you're right, there is indeed an unecessary extra scan happening. To avoid that we can see the key->nextkey and adjust higher boundary correspondingly. Will also add this into the next rebased patch, thanks!
Re: Index Skip Scan
On Mon, Jan 20, 2020 at 5:05 PM Peter Geoghegan wrote: > You can add another assertion that calls a new utility function in > bufmgr.c. That can use the same logic as this existing assertion in > FlushOneBuffer(): > > Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); > > We haven't needed assertions like this so far because it's usually it > is clear whether or not a buffer lock is held (plus the bufmgr.c > assertions help on their own). Just in case anybody missed it, I am working on a patch that makes nbtree use Valgrind instrumentation to detect page accessed without a buffer content lock held: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpwrub_45eexlh1+k...@mail.gmail.com There is also one component that detects when any buffer is accessed without a buffer pin. -- Peter Geoghegan
Re: Index Skip Scan
On Mon, May 11, 2020 at 4:55 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote: > > > > > > +static inline bool > > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > > > + Buffer buf, ScanDirection dir) > > > > +{ > > > > + OffsetNumber low, high; > > > > + Page page = BufferGetPage(buf); > > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > > > + > > > > + low = P_FIRSTDATAKEY(opaque); > > > > + high = PageGetMaxOffsetNumber(page); > > > > + > > > > + if (unlikely(high < low)) > > > > + return false; > > > > + > > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > > > +} > > > > > > > > I think the high key condition should be changed to > > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > > > prefix qual is equal to the high key then also there is no point in > > > > searching on the current page so we can directly skip. > > > > > > From nbtree/README and comments to functions like _bt_split I've got an > > > impression that the high key could be equal to the last item on the leaf > > > page, so there is a point in searching. Is that incorrect? > > > > But IIUC, here we want to decide whether we will get the next key in > > the current page or not? > > In general this function does what it says, it checks wether or not the > provided scankey could be found within the page. All the logic about > finding a proper next key to fetch is implemented on the call site, and > within this function we want to test whatever was passed in. Does it > answer the question? Ok, I agree that the function is doing what it is expected to do. But, then I have a problem with this call site. + /* Check if the next unique key can be found within the current page. + * Since we do not lock the current page between jumps, it's possible + * that it was splitted since the last time we saw it. This is fine in + * case of scanning forward, since page split to the right and we are + * still on the left most page. In case of scanning backwards it's + * possible to loose some pages and we need to remember the previous + * page, and then follow the right link from the current page until we + * find the original one. + * + * Since the whole idea of checking the current page is to protect + * ourselves and make more performant statistic mismatch case when + * there are too many distinct values for jumping, it's not clear if + * the complexity of this solution in case of backward scan is + * justified, so for now just avoid it. + */ + if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir)) + { + LockBuffer(so->currPos.buf, BT_READ); + + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir)) + { Here we expect whether the "next" unique key can be found on this page or not, but we are using the function which suggested whether the "current" key can be found on this page or not. I think in boundary cases where the high key is equal to the current key, this function will return true (which is expected from this function), and based on that we will simply scan the current page and IMHO that cost could be avoided no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote: > > > > +static inline bool > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > > + Buffer buf, ScanDirection dir) > > > +{ > > > + OffsetNumber low, high; > > > + Page page = BufferGetPage(buf); > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > > + > > > + low = P_FIRSTDATAKEY(opaque); > > > + high = PageGetMaxOffsetNumber(page); > > > + > > > + if (unlikely(high < low)) > > > + return false; > > > + > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > > +} > > > > > > I think the high key condition should be changed to > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > > prefix qual is equal to the high key then also there is no point in > > > searching on the current page so we can directly skip. > > > > From nbtree/README and comments to functions like _bt_split I've got an > > impression that the high key could be equal to the last item on the leaf > > page, so there is a point in searching. Is that incorrect? > > But IIUC, here we want to decide whether we will get the next key in > the current page or not? In general this function does what it says, it checks wether or not the provided scankey could be found within the page. All the logic about finding a proper next key to fetch is implemented on the call site, and within this function we want to test whatever was passed in. Does it answer the question?
Re: Index Skip Scan
On Sun, May 10, 2020 at 11:17 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote: > > > > Some more comments... > > Thanks for reviewing. Since this patch took much longer than I expected, > it's useful to have someone to look at it with a "fresh eyes". > > > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > > + _bt_update_skip_scankeys(scan, indexRel); > > + > > ... > > + /* > > + * We haven't found scan key within the current page, so let's scan from > > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset > > + * number > > + */ > > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > > + stack = _bt_search(scan->indexRelation, so->skipScanKey, > > +, BT_READ, scan->xs_snapshot); > > > > Why do we need to set so->skipScanKey->nextkey = > > ScanDirectionIsForward(dir); multiple times? I think it is fine to > > just set it once? > > I believe it was necessary for previous implementations, but in the > current version we can avoid this, you're right. > > > +static inline bool > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > > + Buffer buf, ScanDirection dir) > > +{ > > + OffsetNumber low, high; > > + Page page = BufferGetPage(buf); > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > > + > > + low = P_FIRSTDATAKEY(opaque); > > + high = PageGetMaxOffsetNumber(page); > > + > > + if (unlikely(high < low)) > > + return false; > > + > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > > + _bt_compare(scan->indexRelation, key, page, high) < 1); > > +} > > > > I think the high key condition should be changed to > > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > > prefix qual is equal to the high key then also there is no point in > > searching on the current page so we can directly skip. > > From nbtree/README and comments to functions like _bt_split I've got an > impression that the high key could be equal to the last item on the leaf > page, so there is a point in searching. Is that incorrect? But IIUC, here we want to decide whether we will get the next key in the current page or not? Is my understanding is correct? So if our key (the last tuple key) is equal to the high key means the max key on this page is the same as what we already got in the last tuple so why would we want to go on this page? because this will not give us the new key. So ideally, we should only be looking into this page if our last tuple key is smaller than the high key. Am I missing something? > > > + /* Check if an index skip scan is possible. */ > > + can_skip = enable_indexskipscan & index->amcanskip; > > + > > + /* > > + * Skip scan is not supported when there are qual conditions, which are not > > + * covered by index. The reason for that is that those conditions are > > + * evaluated later, already after skipping was applied. > > + * > > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > > + * index expressions. For that we need to examine index_clauses too. > > + */ > > + if (root->parse->jointree != NULL) > > + { > > + ListCell *lc; > > + > > + foreach(lc, (List *)root->parse->jointree->quals) > > + { > > + Node *expr, *qual = (Node *) lfirst(lc); > > + Var *var; > > > > I think we can avoid checking for expression if can_skip is already false. > > Yes, that makes sense. I'll include your suggestions into the next > rebased version I'm preparing. Ok. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote: > > Some more comments... Thanks for reviewing. Since this patch took much longer than I expected, it's useful to have someone to look at it with a "fresh eyes". > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > + _bt_update_skip_scankeys(scan, indexRel); > + > ... > + /* > + * We haven't found scan key within the current page, so let's scan from > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset > + * number > + */ > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir); > + stack = _bt_search(scan->indexRelation, so->skipScanKey, > +, BT_READ, scan->xs_snapshot); > > Why do we need to set so->skipScanKey->nextkey = > ScanDirectionIsForward(dir); multiple times? I think it is fine to > just set it once? I believe it was necessary for previous implementations, but in the current version we can avoid this, you're right. > +static inline bool > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, > + Buffer buf, ScanDirection dir) > +{ > + OffsetNumber low, high; > + Page page = BufferGetPage(buf); > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page); > + > + low = P_FIRSTDATAKEY(opaque); > + high = PageGetMaxOffsetNumber(page); > + > + if (unlikely(high < low)) > + return false; > + > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 && > + _bt_compare(scan->indexRelation, key, page, high) < 1); > +} > > I think the high key condition should be changed to > _bt_compare(scan->indexRelation, key, page, high) < 0 ? Because if > prefix qual is equal to the high key then also there is no point in > searching on the current page so we can directly skip. >From nbtree/README and comments to functions like _bt_split I've got an impression that the high key could be equal to the last item on the leaf page, so there is a point in searching. Is that incorrect? > + /* Check if an index skip scan is possible. */ > + can_skip = enable_indexskipscan & index->amcanskip; > + > + /* > + * Skip scan is not supported when there are qual conditions, which are not > + * covered by index. The reason for that is that those conditions are > + * evaluated later, already after skipping was applied. > + * > + * TODO: This implementation is too restrictive, and doesn't allow e.g. > + * index expressions. For that we need to examine index_clauses too. > + */ > + if (root->parse->jointree != NULL) > + { > + ListCell *lc; > + > + foreach(lc, (List *)root->parse->jointree->quals) > + { > + Node *expr, *qual = (Node *) lfirst(lc); > + Var *var; > > I think we can avoid checking for expression if can_skip is already false. Yes, that makes sense. I'll include your suggestions into the next rebased version I'm preparing.
Re: Index Skip Scan
Sorry for late reply. > On Tue, Apr 14, 2020 at 09:19:22PM +1200, David Rowley wrote: > > I've not yet looked at the latest patch, but I did put some thoughts > into an email on the other thread that's been discussing UniqueKeys > [1]. > > I'm keen to hear thoughts on the plan I mentioned over there. Likely > it would be best to discuss the specifics of what additional features > we need to add to UniqueKeys for skip scans over here, but discuss any > chances which affect both patches over there. We certainly can't have > two separate implementations of UniqueKeys, so I believe the skip > scans UniqueKeys patch should most likely be based on the one in [1] > or some descendant of it. > > [1] > https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com Yes, I've come to the same conclusion, although I have my concerns about having such a dependency between patches. Will look at the suggested patches soon.
Re: Index Skip Scan
On Wed, 8 Apr 2020 at 07:40, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Other than that to summarize current open points for future readers > (this thread somehow became quite big): > > * Making UniqueKeys usage more generic to allow using skip scan for more > use cases (hopefully it was covered by the v33, but I still need a > confirmation from David, like blinking twice or something). I've not yet looked at the latest patch, but I did put some thoughts into an email on the other thread that's been discussing UniqueKeys [1]. I'm keen to hear thoughts on the plan I mentioned over there. Likely it would be best to discuss the specifics of what additional features we need to add to UniqueKeys for skip scans over here, but discuss any chances which affect both patches over there. We certainly can't have two separate implementations of UniqueKeys, so I believe the skip scans UniqueKeys patch should most likely be based on the one in [1] or some descendant of it. [1] https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com
RE: Index Skip Scan
> > * Suspicious performance difference between different type of workload, > mentioned by Tomas (unfortunately I had no chance yet to investigate). > His benchmark results indeed most likely point to multiple comparisons being done. Since the most likely place where these occur is _bt_readpage, I suspect this is called multiple times. Looking at your patch, I think that's indeed the case. For example, suppose a page contains [1,2,3,4,5] and the planner makes a complete misestimation and chooses a skip scan here. First call to _bt_readpage will compare every tuple on the page already and store everything in the workspace, which will now contain [1,2,3,4,5]. However, when a skip is done the elements on the page (not the workspace) are compared to find the next one. Then, another _bt_readpage is done, starting at the new offnum. So we'll compare every tuple (except 1) on the page again. Workspace now contains [2,3,4,5]. Next tuple we'll end up with [3,4,5] etc. So tuple 5 actually gets compared 5 times in _bt_readpage alone. -Floris
Re: Index Skip Scan
ove), describing the sort * ordering of the path's output rows. + * + * "uniquekeys", if not NIL, is a list of UniqueKey nodes (see above), + * describing the XXX. */ typedef struct Path { @@ -1129,6 +1146,8 @@ typedef struct Path List *pathkeys; /* sort ordering of path's output */ /* pathkeys is a List of PathKey nodes; see above */ + + List *uniquekeys; /* the unique keys, or NIL if none */ } Path; /* Macro for extracting a path's parameterization relids; beware double eval */ diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h index 6126b491bf..006248bfb5 100644 --- a/src/include/nodes/print.h +++ b/src/include/nodes/print.h @@ -28,6 +28,7 @@ extern char *pretty_format_node_dump(const char *dump); extern void print_rt(const List *rtable); extern void print_expr(const Node *expr, const List *rtable); extern void print_pathkeys(const List *pathkeys, const List *rtable); +extern void print_uniquekeys(const List *uniquekeys, const List *rtable); extern void print_tl(const List *tlist, const List *rtable); extern void print_slot(TupleTableSlot *slot); diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h index e450fe112a..fd25997af5 100644 --- a/src/include/optimizer/pathnode.h +++ b/src/include/optimizer/pathnode.h @@ -34,6 +34,7 @@ extern void add_partial_path(RelOptInfo *parent_rel, Path *new_path); extern bool add_partial_path_precheck(RelOptInfo *parent_rel, Cost total_cost, List *pathkeys); +extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path); extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer, int parallel_workers); extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, @@ -44,6 +45,7 @@ extern IndexPath *create_index_path(PlannerInfo *root, List *indexorderbys, List *indexorderbycols, List *pathkeys, + List *uniquekeys, ScanDirection indexscandir, bool indexonly, Relids required_outer, diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h index 9ab73bd20c..5b6be383b3 100644 --- a/src/include/optimizer/paths.h +++ b/src/include/optimizer/paths.h @@ -214,6 +214,9 @@ extern List *build_join_pathkeys(PlannerInfo *root, extern List *make_pathkeys_for_sortclauses(PlannerInfo *root, List *sortclauses, List *tlist); +extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root, + List *sortclauses, + List *tlist); extern void initialize_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo); extern void update_mergeclause_eclasses(PlannerInfo *root, @@ -240,4 +243,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root, extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, List *live_childrels); +/* + * uniquekey.c + * Utilities for matching and building unique keys + */ +extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses); +extern bool uniquekeys_contained_in(List *keys1, List *keys2); +extern bool has_useful_uniquekeys(PlannerInfo *root); + #endif /* PATHS_H */ -- 2.21.0 >From a44f8bf868133f0a342bf8de477d650a455efac7 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 15 Nov 2019 09:46:53 -0500 Subject: [PATCH v33 2/2] Index skip scan Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1]) on top of IndexOnlyScan and IndexScan. To make it suitable for both situations when there are small number of distinct values and significant amount of distinct values the following approach is taken - instead of searching from the root for every value we're searching for then first on the current page, and then if not found continue searching from the root. Original patch and design were proposed by Thomas Munro [2], revived and improved by Dmitry Dolgov and Jesper Pedersen. [1] https://wiki.postgresql.org/wiki/Loose_indexscan [2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com Author: Jesper Pedersen, Dmitry Dolgov Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan --- contrib/bloom/blutils.c | 1 + doc/src/sgml/config.sgml | 15 + doc/src/sgml/indexam.sgml | 63 ++ doc/src/sgml/indices.sgml | 23 + src/backend/access/brin/brin.c| 1 + src/backend/access/gin/ginutil.c | 1 + src/backend/access/gist/gist.c| 1 + src/backend/access/hash/hash.c| 1 + src/backend/access/index/indexam.c| 18 + src/backend/access/nbtree/nbtree.c| 13 + src/backend/access/nbtree/nbtsearch.c | 469 - src/backend/access/spgist/spgutils.c | 1 + src/backend/commands/explain.c
RE: Index Skip Scan
> > Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I > would appreciate if in the future you can make it more visible that changes > you > suggest contain some fixes. E.g. it wasn't clear for me from your previous > email > that that's the case, and it doesn't make sense to pull into different > direction > when we're trying to achieve the same goal :) I wasn't aware that this particular case could be triggered before I saw Dilip's email, otherwise I'd have mentioned it here of course. It's just that because my patch handles filter conditions in general, it works for this case too. > > > > In the patch I posted a week ago these cases are all handled > > > correctly, as it introduces this extra logic in the Executor. > > > > Okay, So I think we can merge those fixes in Dmitry's patch set. > > I'll definitely take a look at suggested changes in filtering part. It may be possible to just merge the filtering part into your patch, but I'm not entirely sure. Basically you have to pull the information about skipping one level up, out of the node, into the generic IndexNext code. I'm eager to get some form of skip scans into master - any kind of patch that makes this possible is fine by me. Long term I think my version provides a more generic approach, with which we can optimize a much broader range of queries. However, since many more eyes have seen your patch so far, I hope yours can be committed much sooner. My knowledge on this committer process is limited though. That's why I've just posted mine so far in the hope of collecting some feedback, also on how we should continue with the process. -Floris
Re: Index Skip Scan
> On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee > wrote: > > > > There's a number of ways we can force a 'filter' rather than an > > 'index condition'. Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I would appreciate if in the future you can make it more visible that changes you suggest contain some fixes. E.g. it wasn't clear for me from your previous email that that's the case, and it doesn't make sense to pull into different direction when we're trying to achieve the same goal :) > > In the patch I posted a week ago these cases are all handled > > correctly, as it introduces this extra logic in the Executor. > > Okay, So I think we can merge those fixes in Dmitry's patch set. I'll definitely take a look at suggested changes in filtering part.
Re: Index Skip Scan
On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee wrote: > > > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > > > > > I was just wondering how the distinct will work with the "skip scan" > > > > if we have some filter? I mean every time we select the unique row > > > > based on the prefix key and that might get rejected by an external > > > > filter right? > > > > > Yeah, you're correct. This patch only handles the index conditions and > doesn't handle any filters correctly. There's a check in the planner for the > IndexScan for example that only columns that exist in the index are used. > However, this check is not sufficient as your example shows. There's a number > of ways we can force a 'filter' rather than an 'index condition' and still > choose a skip scan (WHERE b!=0 is another one I think). This leads to > incorrect query results. Right > This patch would need some logic in the planner to never choose the skip scan > in these cases. Better long-term solution is to adapt the rest of the > executor to work correctly in the cases of external filters (this ties in > with the previous visibility discussion as well, as that's basically also an > external filter, although a special case). I agree > In the patch I posted a week ago these cases are all handled correctly, as it > introduces this extra logic in the Executor. Okay, So I think we can merge those fixes in Dmitry's patch set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: Index Skip Scan
> > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > > > I was just wondering how the distinct will work with the "skip scan" > > > if we have some filter? I mean every time we select the unique row > > > based on the prefix key and that might get rejected by an external > > > filter right? > > Yeah, you're correct. This patch only handles the index conditions and doesn't handle any filters correctly. There's a check in the planner for the IndexScan for example that only columns that exist in the index are used. However, this check is not sufficient as your example shows. There's a number of ways we can force a 'filter' rather than an 'index condition' and still choose a skip scan (WHERE b!=0 is another one I think). This leads to incorrect query results. This patch would need some logic in the planner to never choose the skip scan in these cases. Better long-term solution is to adapt the rest of the executor to work correctly in the cases of external filters (this ties in with the previous visibility discussion as well, as that's basically also an external filter, although a special case). In the patch I posted a week ago these cases are all handled correctly, as it introduces this extra logic in the Executor. -Floris
Re: Index Skip Scan
On Sun, Apr 5, 2020 at 9:39 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > > > I was just wondering how the distinct will work with the "skip scan" > > if we have some filter? I mean every time we select the unique row > > based on the prefix key and that might get rejected by an external > > filter right? > > Not exactly. In the case of index-only scan, we skipping to the first > unique position, and then use already existing functionality > (_bt_readpage with stepping to the next pages) to filter out those > records that do not pass the condition. I agree but that will work if we have a valid index clause, but "b%100=0" condition will not create an index clause, right? However, if we change the query to select distinct(a) from t where b=100 then it works fine because this condition will create an index clause. There are even couple of tests > in the patch for this. In case of index scan, when there are some > conditions, current implementation do not consider skipping. > > > So I tried an example to check this. > > Can you tell on which version of the patch you were testing? I have tested on v33. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
> On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote: > > I was just wondering how the distinct will work with the "skip scan" > if we have some filter? I mean every time we select the unique row > based on the prefix key and that might get rejected by an external > filter right? Not exactly. In the case of index-only scan, we skipping to the first unique position, and then use already existing functionality (_bt_readpage with stepping to the next pages) to filter out those records that do not pass the condition. There are even couple of tests in the patch for this. In case of index scan, when there are some conditions, current implementation do not consider skipping. > So I tried an example to check this. Can you tell on which version of the patch you were testing?
Re: Index Skip Scan
On Wed, Mar 25, 2020 at 2:19 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote: > > > > Seems like you forgot to add the uniquekey.c file in the > > v33-0001-Unique-key.patch. > > Oh, you're right, thanks. Here is the corrected patch. I was just wondering how the distinct will work with the "skip scan" if we have some filter? I mean every time we select the unique row based on the prefix key and that might get rejected by an external filter right? So I tried an example to check this. postgres[50006]=# \d t Table "public.t" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Indexes: "idx" btree (a, b) postgres[50006]=# insert into t select 2, i from generate_series(1, 200)i; INSERT 0 200 postgres[50006]=# insert into t select 1, i from generate_series(1, 200)i; INSERT 0 200 postgres[50006]=# set enable_indexskipscan =off; SET postgres[50006]=# select distinct(a) from t where b%100=0; a --- 1 2 (2 rows) postgres[50006]=# set enable_indexskipscan =on; SET postgres[50006]=# select distinct(a) from t where b%100=0; a --- (0 rows) postgres[50006]=# explain select distinct(a) from t where b%100=0; QUERY PLAN --- Index Only Scan using idx on t (cost=0.15..1.55 rows=10 width=4) Skip scan: true Filter: ((b % 100) = 0) (3 rows) I think in such cases we should not select the skip scan. This should behave like we have a filter on the non-index field. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
*root, extern List *make_pathkeys_for_sortclauses(PlannerInfo *root, List *sortclauses, List *tlist); +extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root, + List *sortclauses, + List *tlist); extern void initialize_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo); extern void update_mergeclause_eclasses(PlannerInfo *root, @@ -240,4 +243,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root, extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, List *live_childrels); +/* + * uniquekey.c + * Utilities for matching and building unique keys + */ +extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses); +extern bool uniquekeys_contained_in(List *keys1, List *keys2); +extern bool has_useful_uniquekeys(PlannerInfo *root); + #endif /* PATHS_H */ -- 2.21.0 >From 666a8095b700365f78de80fa0febc4a0ac24ae7a Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 15 Nov 2019 09:46:53 -0500 Subject: [PATCH v33 2/2] Index skip scan Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1]) on top of IndexOnlyScan and IndexScan. To make it suitable for both situations when there are small number of distinct values and significant amount of distinct values the following approach is taken - instead of searching from the root for every value we're searching for then first on the current page, and then if not found continue searching from the root. Original patch and design were proposed by Thomas Munro [2], revived and improved by Dmitry Dolgov and Jesper Pedersen. [1] https://wiki.postgresql.org/wiki/Loose_indexscan [2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com Author: Jesper Pedersen, Dmitry Dolgov Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan --- contrib/bloom/blutils.c | 1 + doc/src/sgml/config.sgml | 15 + doc/src/sgml/indexam.sgml | 63 ++ doc/src/sgml/indices.sgml | 23 + src/backend/access/brin/brin.c| 1 + src/backend/access/gin/ginutil.c | 1 + src/backend/access/gist/gist.c| 1 + src/backend/access/hash/hash.c| 1 + src/backend/access/index/indexam.c| 18 + src/backend/access/nbtree/nbtree.c| 13 + src/backend/access/nbtree/nbtsearch.c | 469 - src/backend/access/spgist/spgutils.c | 1 + src/backend/commands/explain.c| 25 + src/backend/executor/nodeIndexonlyscan.c | 97 ++- src/backend/executor/nodeIndexscan.c | 56 +- src/backend/nodes/copyfuncs.c | 2 + src/backend/nodes/outfuncs.c | 2 + src/backend/nodes/readfuncs.c | 2 + src/backend/optimizer/path/costsize.c | 1 + src/backend/optimizer/path/indxpath.c | 37 ++ src/backend/optimizer/plan/createplan.c | 20 +- src/backend/optimizer/plan/planner.c | 10 +- src/backend/optimizer/util/pathnode.c | 68 ++ src/backend/optimizer/util/plancat.c | 1 + src/backend/utils/misc/guc.c | 9 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/amapi.h| 8 + src/include/access/genam.h| 2 + src/include/access/nbtree.h | 7 + src/include/access/sdir.h | 7 + src/include/nodes/execnodes.h | 6 + src/include/nodes/pathnodes.h | 5 + src/include/nodes/plannodes.h | 4 + src/include/optimizer/cost.h | 1 + src/include/optimizer/pathnode.h | 3 + src/test/regress/expected/select_distinct.out | 621 ++ src/test/regress/expected/sysviews.out| 3 +- src/test/regress/sql/select_distinct.sql | 254 +++ 38 files changed, 1845 insertions(+), 14 deletions(-) diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c index 0104d02f67..a018b7f3d0 100644 --- a/contrib/bloom/blutils.c +++ b/contrib/bloom/blutils.c @@ -133,6 +133,7 @@ blhandler(PG_FUNCTION_ARGS) amroutine->ambulkdelete = blbulkdelete; amroutine->amvacuumcleanup = blvacuumcleanup; amroutine->amcanreturn = NULL; + amroutine->amskip = NULL; amroutine->amcostestimate = blcostestimate; amroutine->amoptions = bloptions; amroutine->amproperty = NULL; diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e07dc01e80..36ba75b077 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4517,6 +4517,21 @@ ANY num_sync ( + enable_indexskipscan (boolean) + + enable_indexskipscan configuration parameter + + + + +Enables
Re: Index Skip Scan
On Tue, Mar 24, 2020 at 10:08 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote: > > > > Yes, I was complaining that a ProjectionPath breaks the optimisation > > and I don't believe there's any reason that it should. > > > > I believe the way to make that work correctly requires paying > > attention to the Path's uniquekeys rather than what type of path it > > is. > > Thanks for the suggestion. As a result of the discussion I've modified > the patch, does it look similar to what you had in mind? > > In this version if all conditions are met and there are corresponding > unique keys, a new index skip scan path will be added to > unique_pathlists. In case if requested distinct clauses match with > unique keys, create_distinct_paths can choose this path without needen > to know what kind of path is it. Also unique_keys are passed through > ProjectionPath, so optimization for the example mentioned in this thread > before now should work (I've added one test for that). > > I haven't changed anything about UniqueKey structure itself (one of the > suggestions was about Expr instead of EquivalenceClass), but I believe > we need anyway to figure out how two existing imlementation (in this > patch and from [1]) of this idea can be connected. > > [1]: > https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com --- src/backend/nodes/outfuncs.c | 14 ++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/allpaths.c | 8 +++ src/backend/optimizer/path/indxpath.c | 41 src/backend/optimizer/path/pathkeys.c | 71 ++- src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 37 +- src/backend/optimizer/util/pathnode.c | 46 + src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 19 +++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 2 + src/include/optimizer/paths.h | 11 + 15 files changed, 272 insertions(+), 23 deletions(-) Seems like you forgot to add the uniquekey.c file in the v33-0001-Unique-key.patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Index Skip Scan
On Wed, Mar 25, 2020 at 12:41 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote: > > > > There was a dedicated thread [1] where David explain his idea very > > detailed, and you can also check that messages around that message for > > the context. hope it helps. > > Thanks for pointing out to this thread! Somehow I've missed it, and now > looks like we need to make some efforts to align patches for index skip > scan and distincClause elimination. > Yes:). Looks Index skip scan is a way of make a distinct result without a real distinct node, which happens after the UniqueKeys check where I try to see if the result is unique already and before the place where create a unique node for distinct node(With index skip scan we don't need it all). Currently in my patch, the logical here is 1). Check the UniqueKey to see if the result is not unique already. if not, go to next 2). After the distinct paths are created, I will add the result of distinct path as a unique key. Will you add the index skip scan path during create_distincts_paths and add the UniqueKey to RelOptInfo? if so I guess my current patch can handle it since it cares about the result of distinct path but no worried about how we archive that. Best Regards Andy Fan
Re: Index Skip Scan
> On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote: > > There was a dedicated thread [1] where David explain his idea very > detailed, and you can also check that messages around that message for > the context. hope it helps. Thanks for pointing out to this thread! Somehow I've missed it, and now looks like we need to make some efforts to align patches for index skip scan and distincClause elimination.
Re: Index Skip Scan
> On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote: > > Yes, I was complaining that a ProjectionPath breaks the optimisation > and I don't believe there's any reason that it should. > > I believe the way to make that work correctly requires paying > attention to the Path's uniquekeys rather than what type of path it > is. Thanks for the suggestion. As a result of the discussion I've modified the patch, does it look similar to what you had in mind? In this version if all conditions are met and there are corresponding unique keys, a new index skip scan path will be added to unique_pathlists. In case if requested distinct clauses match with unique keys, create_distinct_paths can choose this path without needen to know what kind of path is it. Also unique_keys are passed through ProjectionPath, so optimization for the example mentioned in this thread before now should work (I've added one test for that). I haven't changed anything about UniqueKey structure itself (one of the suggestions was about Expr instead of EquivalenceClass), but I believe we need anyway to figure out how two existing imlementation (in this patch and from [1]) of this idea can be connected. [1]: https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com >From c7af8157da82db1cedf02e6ec0de355b56275680 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Tue, 24 Mar 2020 17:04:32 +0100 Subject: [PATCH v33 1/2] Unique key Design by David Rowley. Author: Jesper Pedersen --- src/backend/nodes/outfuncs.c | 14 ++ src/backend/nodes/print.c | 39 +++ src/backend/optimizer/path/Makefile | 3 +- src/backend/optimizer/path/allpaths.c | 8 +++ src/backend/optimizer/path/indxpath.c | 41 src/backend/optimizer/path/pathkeys.c | 71 ++- src/backend/optimizer/plan/planagg.c | 1 + src/backend/optimizer/plan/planmain.c | 1 + src/backend/optimizer/plan/planner.c | 37 +- src/backend/optimizer/util/pathnode.c | 46 + src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 19 +++ src/include/nodes/print.h | 1 + src/include/optimizer/pathnode.h | 2 + src/include/optimizer/paths.h | 11 + 15 files changed, 272 insertions(+), 23 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..16083e7a7e 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node) WRITE_FLOAT_FIELD(startup_cost, "%.2f"); WRITE_FLOAT_FIELD(total_cost, "%.2f"); WRITE_NODE_FIELD(pathkeys); + WRITE_NODE_FIELD(uniquekeys); } /* @@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(eq_classes); WRITE_BOOL_FIELD(ec_merging_done); WRITE_NODE_FIELD(canon_pathkeys); + WRITE_NODE_FIELD(canon_uniquekeys); WRITE_NODE_FIELD(left_join_clauses); WRITE_NODE_FIELD(right_join_clauses); WRITE_NODE_FIELD(full_join_clauses); @@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_NODE_FIELD(placeholder_list); WRITE_NODE_FIELD(fkey_list); WRITE_NODE_FIELD(query_pathkeys); + WRITE_NODE_FIELD(query_uniquekeys); WRITE_NODE_FIELD(group_pathkeys); WRITE_NODE_FIELD(window_pathkeys); WRITE_NODE_FIELD(distinct_pathkeys); @@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node) WRITE_BOOL_FIELD(pk_nulls_first); } +static void +_outUniqueKey(StringInfo str, const UniqueKey *node) +{ + WRITE_NODE_TYPE("UNIQUEKEY"); + + WRITE_NODE_FIELD(eq_clause); +} + static void _outPathTarget(StringInfo str, const PathTarget *node) { @@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj) case T_PathKey: _outPathKey(str, obj); break; + case T_UniqueKey: +_outUniqueKey(str, obj); +break; case T_PathTarget: _outPathTarget(str, obj); break; diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c index 42476724d8..d286b34544 100644 --- a/src/backend/nodes/print.c +++ b/src/backend/nodes/print.c @@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable) printf(")\n"); } +/* + * print_uniquekeys - + * uniquekeys list of UniqueKeys + */ +void +print_uniquekeys(const List *uniquekeys, const List *rtable) +{ + ListCell *l; + + printf("("); + foreach(l, uniquekeys) + { + UniqueKey *unique_key = (UniqueKey *) lfirst(l); + EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause; + ListCell *k; + bool first = true; + + /* chase up */ + while (eclass->ec_merged) + eclass = eclass->ec_merged; + + printf("("); + foreach(k, eclass->ec_members) + { + EquivalenceMember *mem = (EquivalenceMember *) lfirst(
Re: Index Skip Scan
> > > On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee > wrote: > > I'm unsure which version number to give this patch (to continue with > numbers from previous skip scan patches, or to start numbering from scratch > again). It's a rather big change, so one could argue it's mostly a separate > patch. I guess it mostly depends on how close the original versions were to > be committable. Thoughts? > > I don't know, but from the sidelines, it'd be nice to see the unique > path part go into PG13, where IIUC it can power the "useless unique > removal" patch. > Actually I have a patch to remove the distinct clause some long time ago[1], and later it came to the UniqueKey as well, you can see [2] for the current status. [1] https://www.postgresql.org/message-id/flat/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com#f5d97cc66b9cd330add2fbb004a4d107 [2] https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL=uafudmf4wgovkel3onbatqju9nsxtuucpp...@mail.gmail.com
Re: Index Skip Scan
Hi Floris, On Sun, Mar 22, 2020 at 11:00 AM Floris Van Nee wrote: > create index on t1 (a,b,c); > select * from t1 where b in (100, 200); > Execution Time: 2.464 ms > Execution Time: 252.224 ms > Execution Time: 244.872 ms Wow. This is very cool work and I'm sure it will become a major headline feature of PG14 if the requisite planner brains can be sorted out. On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee wrote: > I'm unsure which version number to give this patch (to continue with numbers > from previous skip scan patches, or to start numbering from scratch again). > It's a rather big change, so one could argue it's mostly a separate patch. I > guess it mostly depends on how close the original versions were to be > committable. Thoughts? I don't know, but from the sidelines, it'd be nice to see the unique path part go into PG13, where IIUC it can power the "useless unique removal" patch.
Re: Index Skip Scan
On Wed, 11 Mar 2020 at 16:44, Andy Fan wrote: >> >> >> I think the UniqueKeys may need to be changed from using >> EquivalenceClasses to use Exprs instead. > > > When I try to understand why UniqueKeys needs EquivalenceClasses, > see your comments here. I feel that FuncExpr can't be > used to as a UniquePath even we can create unique index on f(a) > and f->strict == true. The reason is even we know a is not null, > f->strict = true. it is still be possible that f(a) == null. unique index > allows more than 1 null values. so shall we move further to use varattrno > instead of Expr? if so, we can also use a list of Bitmapset to present multi > unique path of a single RelOptInfo. We do need some method to determine if NULL values are possible. At the base relation level that can probably be done by checking NOT NULL constraints and strict base quals. At higher levels, we can use strict join quals as proofs. As for bit a Bitmapset of varattnos, that would certainly work well at the base relation level when there are no unique expression indexes, but it's not so simple with join relations when the varattnos only mean something when you know which base relation it comes from. I'm not saying that Lists of Exprs is ideal, but I think trying to optimise some code that does not yet exist is premature. There was some other talk in [1] on how we might make checking if a List contains a given Node. That could be advantageous in a few places in the query planner, and it might be useful for this too. [1] https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com
Re: Index Skip Scan
On Tue, Mar 10, 2020 at 4:32 AM James Coleman wrote: > On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthali...@gmail.com> > wrote: > > > > Assuming we'll implement it in a way that we do not know about what kind > > of path type is that in create_distinct_path, then it can also work for > > ProjectionPath or anything else (if UniqueKeys are present). But then > > still EquivalenceMember are used only to figure out correct > > distinctPrefixKeys and do not affect whether or not skipping is applied. > > What do I miss? > > > Part of the puzzle seems to me to this part of the response: > > > I think the UniqueKeys may need to be changed from using > > EquivalenceClasses to use Exprs instead. > > But I can't say I'm being overly helpful by pointing that out, since I > don't have my head in the code enough to understand how you'd > accomplish that :) > > There was a dedicated thread [1] where David explain his idea very detailed, and you can also check that messages around that message for the context. hope it helps. [1] https://www.postgresql.org/message-id/CAApHDvq7i0%3DO97r4Y1pv68%2BtprVczKsXRsV28rM9H-rVPOfeNQ%40mail.gmail.com
Re: Index Skip Scan
> > > I think the UniqueKeys may need to be changed from using > EquivalenceClasses to use Exprs instead. > When I try to understand why UniqueKeys needs EquivalenceClasses, see your comments here. I feel that FuncExpr can't be used to as a UniquePath even we can create unique index on f(a) and f->strict == true. The reason is even we know a is not null, f->strict = true. it is still be possible that f(a) == null. unique index allows more than 1 null values. so shall we move further to use varattrno instead of Expr? if so, we can also use a list of Bitmapset to present multi unique path of a single RelOptInfo.
Re: Index Skip Scan
On Wed, 11 Mar 2020 at 01:38, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote: > > There's also some weird looking assumptions that an EquivalenceMember > > can only be a Var in create_distinct_paths(). I think you're only > > saved from crashing there because a ProjectionPath will be created > > atop of the IndexPath to evaluate expressions, in which case you're > > not seeing the IndexPath. This results in the optimisation not > > working in cases like: > > I've read it as "an assumption that an EquivalenceMember can only be a > Var" results in "the optimisation not working in cases like this". But > you've meant that ignoring a ProjectionPath with an IndexPath inside > results in this optimisation not working, right? If so, then everything > is clear, and my apologies, maybe I need to finally fix my sleep > schedule :) Yes, I was complaining that a ProjectionPath breaks the optimisation and I don't believe there's any reason that it should. I believe the way to make that work correctly requires paying attention to the Path's uniquekeys rather than what type of path it is.
Re: Index Skip Scan
> >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote: > On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Assuming we'll implement it in a way that we do not know about what kind > > of path type is that in create_distinct_path, then it can also work for > > ProjectionPath or anything else (if UniqueKeys are present). But then > > still EquivalenceMember are used only to figure out correct > > distinctPrefixKeys and do not affect whether or not skipping is applied. > > What do I miss? > > I'm not sure I fully understand the question correctly, but let me > explain further. > > In the 0001 patch, standard_qp_callback sets the query_uniquekeys > depending on the DISTINCT / GROUP BY clause. When building index > paths in build_index_paths(), the 0002 patch should be looking at the > root->query_uniquekeys to see if it can build any index paths that > suit those keys. Such paths should be tagged with the uniquekeys they > satisfy, basically, exactly the same as how pathkeys work. Many > create_*_path functions will need to be modified to carry forward > their uniquekeys. For example, create_projection_path(), > create_limit_path() don't do anything which would cause the created > path to violate the unique keys. This way when you get down to > create_distinct_paths(), paths other than IndexPath may have > uniquekeys. You'll be able to check which existing paths satisfy the > unique keys required by the DISTINCT / GROUP BY and select those paths > instead of having to create any HashAggregate / Unique paths. > > Does that answer the question? Hmm... I'm afraid no, this was already clear. But looks like now I see that I've misinterpreted one part. > There's also some weird looking assumptions that an EquivalenceMember > can only be a Var in create_distinct_paths(). I think you're only > saved from crashing there because a ProjectionPath will be created > atop of the IndexPath to evaluate expressions, in which case you're > not seeing the IndexPath. This results in the optimisation not > working in cases like: I've read it as "an assumption that an EquivalenceMember can only be a Var" results in "the optimisation not working in cases like this". But you've meant that ignoring a ProjectionPath with an IndexPath inside results in this optimisation not working, right? If so, then everything is clear, and my apologies, maybe I need to finally fix my sleep schedule :)
Re: Index Skip Scan
On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > Assuming we'll implement it in a way that we do not know about what kind > of path type is that in create_distinct_path, then it can also work for > ProjectionPath or anything else (if UniqueKeys are present). But then > still EquivalenceMember are used only to figure out correct > distinctPrefixKeys and do not affect whether or not skipping is applied. > What do I miss? Part of the puzzle seems to me to this part of the response: > I think the UniqueKeys may need to be changed from using > EquivalenceClasses to use Exprs instead. But I can't say I'm being overly helpful by pointing that out, since I don't have my head in the code enough to understand how you'd accomplish that :) James
Re: Index Skip Scan
On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Assuming we'll implement it in a way that we do not know about what kind > of path type is that in create_distinct_path, then it can also work for > ProjectionPath or anything else (if UniqueKeys are present). But then > still EquivalenceMember are used only to figure out correct > distinctPrefixKeys and do not affect whether or not skipping is applied. > What do I miss? I'm not sure I fully understand the question correctly, but let me explain further. In the 0001 patch, standard_qp_callback sets the query_uniquekeys depending on the DISTINCT / GROUP BY clause. When building index paths in build_index_paths(), the 0002 patch should be looking at the root->query_uniquekeys to see if it can build any index paths that suit those keys. Such paths should be tagged with the uniquekeys they satisfy, basically, exactly the same as how pathkeys work. Many create_*_path functions will need to be modified to carry forward their uniquekeys. For example, create_projection_path(), create_limit_path() don't do anything which would cause the created path to violate the unique keys. This way when you get down to create_distinct_paths(), paths other than IndexPath may have uniquekeys. You'll be able to check which existing paths satisfy the unique keys required by the DISTINCT / GROUP BY and select those paths instead of having to create any HashAggregate / Unique paths. Does that answer the question?
Re: Index Skip Scan
> On Mon, Mar 09, 2020 at 10:27:26AM +1300, David Rowley wrote: > > > > I think the changes in create_distinct_paths() need more work. The > > > way I think this should work is that create_distinct_paths() gets to > > > know exactly nothing about what path types support the elimination of > > > duplicate values. The Path should carry the UniqueKeys so that can be > > > determined. In create_distinct_paths() you should just be able to make > > > use of those paths, which should already have been created when > > > creating index paths for the rel due to PlannerInfo's query_uniquekeys > > > having been set. > > > > Just for me to clarify. The idea is to "move" information about what > > path types support skipping into UniqueKeys (derived from PlannerInfo's > > query_uniquekeys), but other checks (e.g. if index am supports that) > > still perform in create_distinct_paths? > > create_distinct_paths() shouldn't know any details specific to the > pathtype that it's using or considering using. All the details should > just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be > no IsA(path, ...). Please have a look over the details in my reply to > Tomas. I hope that reply has enough information in it, but please > reply there if I've missed something. Yes, I've read this reply, just wanted to ask here, since I had other questions as well. Speaking of which: > > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > > There's also some weird looking assumptions that an EquivalenceMember > > > can only be a Var in create_distinct_paths(). I think you're only > > > saved from crashing there because a ProjectionPath will be created > > > atop of the IndexPath to evaluate expressions, in which case you're > > > not seeing the IndexPath. I'm probably missing something, so to eliminate any misunderstanding from my side: > > > This results in the optimisation not working in cases like: > > > > > > postgres=# create table t (a int); create index on t ((a+1)); explain > > > select distinct a+1 from t; > > > CREATE TABLE > > > CREATE INDEX > > > QUERY PLAN > > > --- > > > HashAggregate (cost=48.25..50.75 rows=200 width=4) > > >Group Key: (a + 1) > > >-> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) In this particular example skipping is not applied because, as you've mentioned, we're dealing with ProjectionPath (not IndexScan / IndexOnlyScan). Which means we're not even reaching the code with EquivalenceMember, so I'm still not sure how do they connected? Assuming we'll implement it in a way that we do not know about what kind of path type is that in create_distinct_path, then it can also work for ProjectionPath or anything else (if UniqueKeys are present). But then still EquivalenceMember are used only to figure out correct distinctPrefixKeys and do not affect whether or not skipping is applied. What do I miss?
Re: Index Skip Scan
On Mon, 9 Mar 2020 at 03:21, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > > > I've been looking over v32 of the patch and have a few comments > > regarding the planner changes. > > Thanks for the commentaries! > > > I think the changes in create_distinct_paths() need more work. The > > way I think this should work is that create_distinct_paths() gets to > > know exactly nothing about what path types support the elimination of > > duplicate values. The Path should carry the UniqueKeys so that can be > > determined. In create_distinct_paths() you should just be able to make > > use of those paths, which should already have been created when > > creating index paths for the rel due to PlannerInfo's query_uniquekeys > > having been set. > > Just for me to clarify. The idea is to "move" information about what > path types support skipping into UniqueKeys (derived from PlannerInfo's > query_uniquekeys), but other checks (e.g. if index am supports that) > still perform in create_distinct_paths? create_distinct_paths() shouldn't know any details specific to the pathtype that it's using or considering using. All the details should just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be no IsA(path, ...). Please have a look over the details in my reply to Tomas. I hope that reply has enough information in it, but please reply there if I've missed something. > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > There's also some weird looking assumptions that an EquivalenceMember > > can only be a Var in create_distinct_paths(). I think you're only > > saved from crashing there because a ProjectionPath will be created > > atop of the IndexPath to evaluate expressions, in which case you're > > not seeing the IndexPath. This results in the optimisation not > > working in cases like: > > > > postgres=# create table t (a int); create index on t ((a+1)); explain > > select distinct a+1 from t; > > CREATE TABLE > > CREATE INDEX > > QUERY PLAN > > --- > > HashAggregate (cost=48.25..50.75 rows=200 width=4) > >Group Key: (a + 1) > >-> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) > > Yes, I need to fix it. > > > Using unique paths as I mentioned above should see that fixed. > > I'm a bit confused about this statement, how exactly unique paths should > fix this? The path's uniquekeys would mention that it's unique on (a+1). You'd compare the uniquekeys of the path to the DISTINCT clause and see that the uniquekeys are a subset of the DISTINCT clause therefore the DISTINCT is a no-op. If that uniquekey path is cheaper than the cheapest_total_path + , then you should pick the unique path, otherwise use the cheapest_total_path and uniquify that. I think the UniqueKeys may need to be changed from using EquivalenceClasses to use Exprs instead.
Re: Index Skip Scan
> On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > > I've been looking over v32 of the patch and have a few comments > regarding the planner changes. Thanks for the commentaries! > I think the changes in create_distinct_paths() need more work. The > way I think this should work is that create_distinct_paths() gets to > know exactly nothing about what path types support the elimination of > duplicate values. The Path should carry the UniqueKeys so that can be > determined. In create_distinct_paths() you should just be able to make > use of those paths, which should already have been created when > creating index paths for the rel due to PlannerInfo's query_uniquekeys > having been set. Just for me to clarify. The idea is to "move" information about what path types support skipping into UniqueKeys (derived from PlannerInfo's query_uniquekeys), but other checks (e.g. if index am supports that) still perform in create_distinct_paths? > Also, should create_grouping_paths() be getting the same code? > Jesper's UniqueKey patch seems to set query_uniquekeys when there's a > GROUP BY with no aggregates. So it looks like he has intended that > something like: > > SELECT x FROM t GROUP BY x; > > should work the same way as > > SELECT DISTINCT x FROM t; > > but the 0002 patch does not make this work. Has that just been overlooked? I believe it wasn't overlooked in 0002 patch, but rather added just in case in 0001. I guess there are no theoretical problems in implementing it, but since we wanted to keep scope of the patch under control and concentrate on the existing functionality it probably makes sense to plan it as one of the next steps? > There's also some weird looking assumptions that an EquivalenceMember > can only be a Var in create_distinct_paths(). I think you're only > saved from crashing there because a ProjectionPath will be created > atop of the IndexPath to evaluate expressions, in which case you're > not seeing the IndexPath. This results in the optimisation not > working in cases like: > > postgres=# create table t (a int); create index on t ((a+1)); explain > select distinct a+1 from t; > CREATE TABLE > CREATE INDEX > QUERY PLAN > --- > HashAggregate (cost=48.25..50.75 rows=200 width=4) >Group Key: (a + 1) >-> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) Yes, I need to fix it. > Using unique paths as I mentioned above should see that fixed. I'm a bit confused about this statement, how exactly unique paths should fix this?
Re: Index Skip Scan
On Sat, 7 Mar 2020 at 03:49, Tomas Vondra wrote: > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: > >The reason it must be done this way is that when the RelOptInfo that > >we're performing the DISTINCT on is a joinrel, then we're not going to > >see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort > >of Join path instead. I understand you're not yet supporting doing > >this optimisation when joins are involved, but it should be coded in > >such a way that it'll work when we do. (It's probably also a separate > >question as to whether we should have this only work when there are no > >joins. I don't think I personally object to it for stage 1, but > >perhaps someone else might think differently.) > > > > I don't follow. Can you elaborate more? > > AFAICS skip-scan is essentially a capability of an (index) AM. I don't > see how we could ever do that for joinrels? We can do that at the scan > level, below a join, but that's what this patch already supports, I > think. When you say "supporting this optimisation" with joins, do you > mean doing skip-scan for join inputs, or on top of the join? The skip index scan Path would still be created at the base rel level, but the join path on the join relation would have one of the sub-paths of the join as an index skip scan. An example query that could make use of this is: SELECT * FROM some_table WHERE a IN(SELECT indexed_col_with_few_distinct_values FROM big_table); In this case, we might want to create a Skip Scan path on big_table using the index on the "indexed_col_with_few_distinct_values", then Hash Join to "some_table". That class of query is likely stage 2 or 3 of this work, but we need to lay foundations that'll support it. As for not having IndexScan paths in joinrels. Yes, of course, but that's exactly why create_distinct_paths() cannot work the way the patch currently codes it. The patch does: + /* + * XXX: In case of index scan quals evaluation happens + * after ExecScanFetch, which means skip results could be + * fitered out. Consider the following query: + * + * select distinct (a, b) a, b, c from t where c < 100; + * + * Skip scan returns one tuple for one distinct set of (a, + * b) with arbitrary one of c, so if the choosed c does + * not match the qual and there is any c that matches the + * qual, we miss that tuple. + */ + if (path->pathtype == T_IndexScan && which will never work for join relations since they'll only have paths for Loop/Merge/Hash type joins. The key here is to determine which skip scan paths we should create when we're building the normal index paths then see if we can make use of those when planning joins. Subsequently, we'll then see if we can make use of the resulting join paths during create_distinct_paths(). Doing it this way will allow us to use skip scans in queries such as: SELECT DISTINCT t1.z FROM t1 INNER JOIN t2 ON t1.a = t2.unique_col; We'll first create the skip scan paths on t1, then when creating the join paths we'll create additional join paths which use the skipscan path. Because t1.unique_col will at most have 1 join partner for each t2 row, then the join path will have the same unique_keys as the skipscan path. That'll allow us to use the join path which has the skip scan on whichever side of the join the t1 relation ends up. All create_distinct_paths() should be doing is looking for paths that are already implicitly unique on the distinct clause and consider using those in a cost-based way. It shouldn't be making such paths itself. > >For storing these new paths with UniqueKeys, I'm not sure exactly if > >we can just add_path() such paths into the RelOptInfo's pathlist. > >What we don't want to do is accidentally make use of paths which > >eliminate duplicate values when we don't want that behaviour. If we > >did store these paths in RelOptInfo->pathlist then we'd need to go and > >modify a bunch of places to ignore such paths. set_cheapest() would > >have to do something special for them too, which makes me think > >pathlist is the incorrect place. Parallel query added > >partial_pathlist, so perhaps we need unique_pathlist to make this > >work. > > > > Hmmm, good point. Do we actually produce incorrect plans with the > current patch, using skip-scan path when we should not? I don't think so. The patch is only creating skip scan paths on the base rel when we discover it's valid to do so. That's not the way it should work though. How the patch currently works would be similar to initially only creating a SeqScan path for a query such as: SELECT * FROM tab ORDER BY a;, but then, during create_ordered_paths() go and create some IndexPath to scan the btree index on tab.a because we suddenly realise that it'll be good to use that for the
Re: Index Skip Scan
On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote: On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthali...@gmail.com> wrote: Here is something similar to what I had in mind. (changing to this email address for future emails) Hi, I've been looking over v32 of the patch and have a few comments regarding the planner changes. I think the changes in create_distinct_paths() need more work. The way I think this should work is that create_distinct_paths() gets to know exactly nothing about what path types support the elimination of duplicate values. The Path should carry the UniqueKeys so that can be determined. In create_distinct_paths() you should just be able to make use of those paths, which should already have been created when creating index paths for the rel due to PlannerInfo's query_uniquekeys having been set. +1 to code this in a generic way, using query_uniquekeys (if possible) The reason it must be done this way is that when the RelOptInfo that we're performing the DISTINCT on is a joinrel, then we're not going to see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort of Join path instead. I understand you're not yet supporting doing this optimisation when joins are involved, but it should be coded in such a way that it'll work when we do. (It's probably also a separate question as to whether we should have this only work when there are no joins. I don't think I personally object to it for stage 1, but perhaps someone else might think differently.) I don't follow. Can you elaborate more? AFAICS skip-scan is essentially a capability of an (index) AM. I don't see how we could ever do that for joinrels? We can do that at the scan level, below a join, but that's what this patch already supports, I think. When you say "supporting this optimisation" with joins, do you mean doing skip-scan for join inputs, or on top of the join? For storing these new paths with UniqueKeys, I'm not sure exactly if we can just add_path() such paths into the RelOptInfo's pathlist. What we don't want to do is accidentally make use of paths which eliminate duplicate values when we don't want that behaviour. If we did store these paths in RelOptInfo->pathlist then we'd need to go and modify a bunch of places to ignore such paths. set_cheapest() would have to do something special for them too, which makes me think pathlist is the incorrect place. Parallel query added partial_pathlist, so perhaps we need unique_pathlist to make this work. Hmmm, good point. Do we actually produce incorrect plans with the current patch, using skip-scan path when we should not? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Index Skip Scan
On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthali...@gmail.com> wrote: > Here is something similar to what I had in mind. (changing to this email address for future emails) Hi, I've been looking over v32 of the patch and have a few comments regarding the planner changes. I think the changes in create_distinct_paths() need more work. The way I think this should work is that create_distinct_paths() gets to know exactly nothing about what path types support the elimination of duplicate values. The Path should carry the UniqueKeys so that can be determined. In create_distinct_paths() you should just be able to make use of those paths, which should already have been created when creating index paths for the rel due to PlannerInfo's query_uniquekeys having been set. The reason it must be done this way is that when the RelOptInfo that we're performing the DISTINCT on is a joinrel, then we're not going to see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort of Join path instead. I understand you're not yet supporting doing this optimisation when joins are involved, but it should be coded in such a way that it'll work when we do. (It's probably also a separate question as to whether we should have this only work when there are no joins. I don't think I personally object to it for stage 1, but perhaps someone else might think differently.) For storing these new paths with UniqueKeys, I'm not sure exactly if we can just add_path() such paths into the RelOptInfo's pathlist. What we don't want to do is accidentally make use of paths which eliminate duplicate values when we don't want that behaviour. If we did store these paths in RelOptInfo->pathlist then we'd need to go and modify a bunch of places to ignore such paths. set_cheapest() would have to do something special for them too, which makes me think pathlist is the incorrect place. Parallel query added partial_pathlist, so perhaps we need unique_pathlist to make this work. Also, should create_grouping_paths() be getting the same code? Jesper's UniqueKey patch seems to set query_uniquekeys when there's a GROUP BY with no aggregates. So it looks like he has intended that something like: SELECT x FROM t GROUP BY x; should work the same way as SELECT DISTINCT x FROM t; but the 0002 patch does not make this work. Has that just been overlooked? There's also some weird looking assumptions that an EquivalenceMember can only be a Var in create_distinct_paths(). I think you're only saved from crashing there because a ProjectionPath will be created atop of the IndexPath to evaluate expressions, in which case you're not seeing the IndexPath. This results in the optimisation not working in cases like: postgres=# create table t (a int); create index on t ((a+1)); explain select distinct a+1 from t; CREATE TABLE CREATE INDEX QUERY PLAN --- HashAggregate (cost=48.25..50.75 rows=200 width=4) Group Key: (a + 1) -> Seq Scan on t (cost=0.00..41.88 rows=2550 width=4) Using unique paths as I mentioned above should see that fixed. David
Re: Index Skip Scan
*distinct_pathkeys; /* distinctClause pathkeys, if any */ @@ -1077,6 +1081,15 @@ typedef struct ParamPathInfo List *ppi_clauses; /* join clauses available from outer rels */ } ParamPathInfo; +/* + * UniqueKey + */ +typedef struct UniqueKey +{ + NodeTag type; + + EquivalenceClass *eq_clause; /* equivalence class */ +} UniqueKey; /* * Type "Path" is used as-is for sequential-scan paths, as well as some other @@ -1106,6 +1119,9 @@ typedef struct ParamPathInfo * * "pathkeys" is a List of PathKey nodes (see above), describing the sort * ordering of the path's output rows. + * + * "uniquekeys", if not NIL, is a list of UniqueKey nodes (see above), + * describing the XXX. */ typedef struct Path { @@ -1129,6 +1145,8 @@ typedef struct Path List *pathkeys; /* sort ordering of path's output */ /* pathkeys is a List of PathKey nodes; see above */ + + List *uniquekeys; /* the unique keys, or NIL if none */ } Path; /* Macro for extracting a path's parameterization relids; beware double eval */ diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h index 6126b491bf..006248bfb5 100644 --- a/src/include/nodes/print.h +++ b/src/include/nodes/print.h @@ -28,6 +28,7 @@ extern char *pretty_format_node_dump(const char *dump); extern void print_rt(const List *rtable); extern void print_expr(const Node *expr, const List *rtable); extern void print_pathkeys(const List *pathkeys, const List *rtable); +extern void print_uniquekeys(const List *uniquekeys, const List *rtable); extern void print_tl(const List *tlist, const List *rtable); extern void print_slot(TupleTableSlot *slot); diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h index e450fe112a..f75ff6f323 100644 --- a/src/include/optimizer/pathnode.h +++ b/src/include/optimizer/pathnode.h @@ -44,6 +44,7 @@ extern IndexPath *create_index_path(PlannerInfo *root, List *indexorderbys, List *indexorderbycols, List *pathkeys, + List *uniquekeys, ScanDirection indexscandir, bool indexonly, Relids required_outer, diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h index 9ab73bd20c..5b6be383b3 100644 --- a/src/include/optimizer/paths.h +++ b/src/include/optimizer/paths.h @@ -214,6 +214,9 @@ extern List *build_join_pathkeys(PlannerInfo *root, extern List *make_pathkeys_for_sortclauses(PlannerInfo *root, List *sortclauses, List *tlist); +extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root, + List *sortclauses, + List *tlist); extern void initialize_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo); extern void update_mergeclause_eclasses(PlannerInfo *root, @@ -240,4 +243,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root, extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel, List *live_childrels); +/* + * uniquekey.c + * Utilities for matching and building unique keys + */ +extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses); +extern bool uniquekeys_contained_in(List *keys1, List *keys2); +extern bool has_useful_uniquekeys(PlannerInfo *root); + #endif /* PATHS_H */ -- 2.21.0 >From 1345f11ca221219a57a98552773963b4c3ceeac0 Mon Sep 17 00:00:00 2001 From: jesperpedersen Date: Fri, 15 Nov 2019 09:46:53 -0500 Subject: [PATCH v32 2/2] Index skip scan Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1]) on top of IndexOnlyScan and IndexScan. To make it suitable for both situations when there are small number of distinct values and significant amount of distinct values the following approach is taken - instead of searching from the root for every value we're searching for then first on the current page, and then if not found continue searching from the root. Original patch and design were proposed by Thomas Munro [2], revived and improved by Dmitry Dolgov and Jesper Pedersen. [1] https://wiki.postgresql.org/wiki/Loose_indexscan [2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com Author: Jesper Pedersen, Dmitry Dolgov Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan --- contrib/bloom/blutils.c | 1 + doc/src/sgml/config.sgml | 15 + doc/src/sgml/indexam.sgml | 63 ++ doc/src/sgml/indices.sgml | 23 + src/backend/access/brin/brin.c| 1 + src/backend/access/gin/ginutil.c | 1 + src/backend/access/gist/gist.c| 1 + src/backend/access/hash/hash.c| 1 + src/backend/access/index/indexam.c| 18 + src/backend/access/nbtree/nbtree.c| 13 + src/backend/access/nbtree/nbtsearch.c | 466 +- src/
Re: Index Skip Scan
> On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote: > The first attached (renamed to .txt not to confuse the cfbots) is a > small patch that makes sure if _bt_readpage is called with the proper > condition as written in its comment, that is, caller must have pinned > and read-locked so->currPos.buf. This patch reveals many instances of > breakage of the contract. Thanks! On top of which patch version one can apply it? I'm asking because I believe I've addressed similar issues in the last version, and the last proposed diff (after resolving some conflicts) breaks tests for me, so not sure if I miss something. At the same time if you and Tomas strongly agree that it actually makes sense to make moving forward/reading backward case work with dead tuples correctly, I'll take a shot and try to teach the code around _bt_skip to do what is required for that. I can merge your changes there and we can see what would be the result.