Testing Index Skip scan

2022-06-28 Thread Alexandre Felipe
I have been interested in a query that returns a batch of results filtered
by a subset of the first column of an index and ordered by the second.

I created a simple (hopefully) reproducible example of the issue, the two
queries describe the same data but have very different costs (explain
output included in the attached file).
server_version 12.8

On slack #pgsql-hackers channel, @sfrost, it was suggested that what I
described is achieved by index skip scan. How can I get a development build
to test this feature?


create table ordered_skip_index_scan_test (id serial not null
constraint ordered_skip_index_scan_test_pkey primary key, col1 INT4, col2 
INT4);

insert into ordered_skip_index_scan_test (col1, col2)
  select trunc(sqrt(random()*1e6)) as col1, (random()*1e6)::int as col2
from generate_series(1, 11e6);

create index ordered_skip_index_col2 on ordered_skip_index_scan_test using 
btree (col2)
create index ordered_skip_index_col1_col2 on ordered_skip_index_scan_test using 
btree (col1, col2)
analyse ordered_skip_index_scan_test;

select count(*) from ordered_skip_index_scan_test;

explain (analyse, buffers)
select * from ordered_skip_index_scan_test
where col2 > 5
  and col1 in (1, 999)
order by col2
limit 10;
-- Limit  (cost=0.43..263.30 rows=10 width=12) (actual time=0.083..5.668 
rows=10 loops=1)
--   Buffers: shared hit=5362
--   ->  Index Scan using ordered_skip_index_col2 on 
ordered_skip_index_scan_test  (cost=0.43..492524.49 rows=18737 width=12) 
(actual time=0.082..5.664 rows=10 loops=1)
-- Index Cond: (col2 > 5)
-- Filter: (col1 = ANY ('{1,999}'::integer[]))
-- Rows Removed by Filter: 5343
-- Buffers: shared hit=5362
-- Planning Time: 0.135 ms
-- Execution Time: 5.693 ms


explain (analyse, buffers)
with t as ((select *
   from ordered_skip_index_scan_test
   where col2 between 5 and 50
 and col1 = 13
   limit 10)
union (select *
   from ordered_skip_index_scan_test
   -- Not taking advantage of the partial
   -- results to adjust the bounds of the
   -- subsequent queries
   where col2 between 5 and 50
 and col1 = 8
   limit 10)
union (select *
   from ordered_skip_index_scan_test
   where col2 between 5 and 50
 and col1 = 5
   limit 10)
union (select *
   from ordered_skip_index_scan_test
   where col2 between 5 and 50
 and col1 = 845
   limit 10))
select * from t
order by col2
limit 10;
-- Limit  (cost=159.95..159.97 rows=10 width=12) (actual time=0.225..0.230 
rows=10 loops=1)
--   Buffers: shared hit=52
--   ->  Sort  (cost=159.95..160.05 rows=40 width=12) (actual time=0.224..0.227 
rows=10 loops=1)
-- Sort Key: ordered_skip_index_scan_test.col2
-- Sort Method: top-N heapsort  Memory: 25kB
-- Buffers: shared hit=52
-- ->  HashAggregate  (cost=158.28..158.68 rows=40 width=12) (actual 
time=0.167..0.178 rows=40 loops=1)
--   Group Key: ordered_skip_index_scan_test.id, 
ordered_skip_index_scan_test.col1, ordered_skip_index_scan_test.col2
--   Batches: 1  Memory Usage: 24kB
--   Buffers: shared hit=52
--   ->  Append  (cost=0.43..157.98 rows=40 width=12) (actual 
time=0.026..0.144 rows=40 loops=1)
-- Buffers: shared hit=52
-- ->  Limit  (cost=0.43..39.35 rows=10 width=12) (actual 
time=0.026..0.044 rows=10 loops=1)
--   Buffers: shared hit=13
--   ->  Index Scan using ordered_skip_index_col1_col2 
on ordered_skip_index_scan_test  (cost=0.43..17214.39 rows=4424 width=12) 
(actual time=0.025..0.042 rows=10 loops=1)
-- Index Cond: ((col1 = 13) AND (col2 >= 5) 
AND (col2 <= 50))
-- Buffers: shared hit=13
-- ->  Limit  (cost=0.43..39.35 rows=10 width=12) (actual 
time=0.011..0.028 rows=10 loops=1)
--   Buffers: shared hit=13
--   ->  Index Scan using ordered_skip_index_col1_col2 
on ordered_skip_index_scan_test ordered_skip_index_scan_test_1  
(cost=0.43..17214.39 rows=4424 width=12) (actual time=0.011..0.026 rows=10 
loops=1)
-- Index Cond: ((col1 = 8) AND (col2 >= 5) 
AND (col2 <= 50))
-- Buffers: shared hit=13
-- ->  Limit  (cost=0.43..39.35 rows=10 width=12) (actual 
time=0.010..0.025 rows=10 loops=1)
--   Buffers: shared hit=13
--   ->  Index Scan using ordered_skip_index_col1_col2 
on ordered_skip_index_scan_test ordered_skip_index_scan_test_2  
(cost=0.43..17214.39 rows=4424 width=12) (actual time=0.010..0.023 rows=10 
loops=1)
-- Index Cond: ((col1 = 5) AND (col2 >= 5) 
AND (col2 <= 50))

Re: MDAM techniques and Index Skip Scan patch

2022-03-28 Thread Peter Geoghegan
On Mon, Mar 28, 2022 at 7:07 PM Tom Lane  wrote:
> Right, that's the case I had in mind --- apologies if my terminology
> was faulty.  btree can actually handle such a case now, but what it
> fails to do is re-descend from the tree root instead of plowing
> forward in the index to find the next matching entry.

KNNGIST seems vaguely related to what we'd build for nbtree skip scan,
though. GiST index scans are "inherently loose", though. KNNGIST uses
a pairing heap/priority queue, which seems like the kind of thing
nbtree skip scan can avoid.

> +1.  We at least need to be sure we all are using these terms
> the same way.

Yeah, there are *endless* opportunities for confusion here.

-- 
Peter Geoghegan




Re: MDAM techniques and Index Skip Scan patch

2022-03-28 Thread Tom Lane
Peter Geoghegan  writes:
> The terminology in this area is a mess. MySQL calls
> SELECT-DISTINCT-that-matches-an-index "loose index scans". I think
> that you're talking about skip scan when you say "loose index scan".
> Skip scan is where there is an omitted prefix of columns in the SQL
> query -- omitted columns after the first column that lack an equality
> qual.

Right, that's the case I had in mind --- apologies if my terminology
was faulty.  btree can actually handle such a case now, but what it
fails to do is re-descend from the tree root instead of plowing
forward in the index to find the next matching entry.

> It might be useful for somebody to go write a "taxonomy of MDAM
> techniques", or a glossary.

+1.  We at least need to be sure we all are using these terms
the same way.

regards, tom lane




Re: MDAM techniques and Index Skip Scan patch

2022-03-28 Thread Peter Geoghegan
On Mon, Mar 28, 2022 at 5:21 PM Tom Lane  wrote:
> In any case, what I was on about is _bt_preprocess_keys() and
> adjacent code.  I'm surprised that those aren't more expensive
> than one palloc in _bt_first.  Maybe that logic falls through very
> quickly in simple cases, though.

I assume that it doesn't really appear in very simple cases (also
common cases). But delaying the scan setup work until execution time
does seem ugly. That's probably a good enough reason to refactor.

-- 
Peter Geoghegan




Re: MDAM techniques and Index Skip Scan patch

2022-03-28 Thread Peter Geoghegan
On Tue, Mar 22, 2022 at 1:55 PM Tom Lane  wrote:
> Peter asked me off-list to spend some time thinking about the overall
> direction we ought to be pursuing here.

Thanks for taking a look!

"5.5 Exploiting Key Prefixes" and "5.6 Ordered Retrieval" from "Modern
B-Tree Techniques" are also good, BTW.

The terminology in this area is a mess. MySQL calls
SELECT-DISTINCT-that-matches-an-index "loose index scans". I think
that you're talking about skip scan when you say "loose index scan".
Skip scan is where there is an omitted prefix of columns in the SQL
query -- omitted columns after the first column that lack an equality
qual. Pretty sure that MySQL/InnoDB can't do that -- it can only
"skip" to the extent required to make
SELECT-DISTINCT-that-matches-an-index perform well, but that's about
it.

It might be useful for somebody to go write a "taxonomy of MDAM
techniques", or a glossary. The existing "Loose indexscan" Postgres
wiki page doesn't seem like enough. Something very high level and
explicit, with examples, just so we don't end up talking at cross
purposes too much.

> 1. Usually I'm in favor of doing this sort of thing in an index AM
> agnostic way, but here I don't see much point.  All of the ideas at
> stake rely fundamentally on having a lexicographically-ordered multi
> column index; but we don't have any of those except btree, nor do
> I think we're likely to get any soon.  This motivates the general
> tenor of my remarks below, which is "do it in access/nbtree/ not in
> the planner".

That was my intuition all along, but I didn't quite have the courage
to say so -- sounds too much like something that an optimizer
dilettante like me would be expected to say.  :-)

Seems like one of those things where lots of high level details
intrinsically need to be figured out on-the-fly, at execution time,
rather than during planning. Perhaps it'll be easier to correctly
determine that a skip scan plan is the cheapest in practice than to
accurately cost skip scan plans. If the only alternative is a
sequential scan, then perhaps a very approximate cost model will work
well enough. It's probably way too early to tell right now, though.

> I've worried in the past that we'd
> need planner/statistical support to figure out whether a loose scan
> is likely to be useful compared to just plowing ahead in the index.

I don't expect to be able to come up with a structure that leaves no
unanswered questions about future MDAM work -- it's not realistic to
expect everything to just fall into place. But that's okay. Just
having everybody agree on roughly the right conceptual model is the
really important thing. That now seems quite close, which I count as
real progress.

> 4. I find each of the above ideas to be far more attractive than
> optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
> understand why the current patchsets seem to be driven largely
> by that single use-case.  I wouldn't even bother with that for the
> initial patch.

I absolutely agree. I wondered about that myself in the past. My best
guess is that a certain segment of users are familiar with
SELECT-DISTINCT-that-matches-an-index from MySQL. And so to some
extent application frameworks evolved in a world where that capability
existed. IIRC Jesper once said that Hibernate relied on this
capability.

It's probably a lot easier to implement
SELECT-DISTINCT-that-matches-an-index if you have the MySQL storage
engine model, with concurrency control that's typically based on
two-phase locking. I think that MySQL does some amount of
deduplication in its executor here -- and *not* in what they call the storage
engine. This is exactly what I'd like to avoid in Postgres; as I said
"Maintenance of Index Order" (as the paper calls it) seems important,
and not something to be added later on. Optimizer and executor layers
that each barely know the difference between a skip scan and a full
index scan seems like something we might actually want to aim for,
rather than avoid. Teaching nbtree to transform quals into ranges
sounds odd at first, but it seems like the right approach now, on
balance -- that's the only *good* way to maintain index order.
(Maintaining index order is needed to avoid needing or relying on
deduplication in the executor proper, which is even inappropriate in
an implementation of SELECT-DISTINCT-that-matches-an-index IMO.)

--
Peter Geoghegan




Re: MDAM techniques and Index Skip Scan patch

2022-03-28 Thread Tom Lane
Peter Geoghegan  writes:
> We could get rid of dynamic allocations for BTStackData in
> _bt_first(), perhaps. The problem is that there is no simple,
> reasonable proof of the maximum height on a B-tree, even though a
> B-Tree with more than 7 or 8 levels seems extraordinarily unlikely.

Start with a few entries preallocated, and switch to dynamically
allocated space if there turn out to be more levels than that,
perhaps?  Not sure if it's worth the trouble.

In any case, what I was on about is _bt_preprocess_keys() and
adjacent code.  I'm surprised that those aren't more expensive
than one palloc in _bt_first.  Maybe that logic falls through very
quickly in simple cases, though.

regards, tom lane




Re: MDAM techniques and Index Skip Scan patch

2022-03-28 Thread Peter Geoghegan
On Tue, Mar 22, 2022 at 4:06 PM Andres Freund  wrote:
> Are you thinking of just moving the setup stuff in nbtree (presumably parts of
> _bt_first() / _bt_preprocess_keys()) or also stuff in
> ExecIndexBuildScanKeys()?
>
> The latter does show up a bit more heavily in profiles than nbtree specific
> setup, and given that it's generic executor type stuff, seems even more
> amenable to being moved to plan time.

When I was working on the patch series that became the nbtree Postgres
12 work, this came up. At one point I discovered that using palloc0()
for the insertion scankey in _bt_first() was a big problem with nested
loop joins -- it became a really noticeable bottleneck with one of my
test cases. I independently discovered what Tom must have figured out
back in 2005, when he committed d961a56899. This led to my making the
new insertion scan key structure (BTScanInsertData) not use dynamic
allocation. So _bt_first() is definitely performance critical for
certain types of queries.

We could get rid of dynamic allocations for BTStackData in
_bt_first(), perhaps. The problem is that there is no simple,
reasonable proof of the maximum height on a B-tree, even though a
B-Tree with more than 7 or 8 levels seems extraordinarily unlikely.
You could also invent a slow path (maybe do what we do in
_bt_insert_parent() in the event of a concurrent root page split/NULL
stack), but that runs into the problem of being awkward to test, and
pretty ugly. It's doable, but I wouldn't do it unless there was a
pretty noticeable payoff.

-- 
Peter Geoghegan




Re: Index Skip Scan (new UniqueKeys)

2022-03-24 Thread Jesper Pedersen

On 6/9/20 06:22, Dmitry Dolgov wrote:

Here is a new version of index skip scan patch, based on v8 patch for
UniqueKeys implementation from [1]. I want to start a new thread to
simplify navigation, hopefully I didn't forget anyone who actively
participated in the discussion.



This CommitFest entry has been closed with RwF at [1].

Thanks for all the feedback given !

[1] 
https://www.postgresql.org/message-id/ab8636e7-182f-886a-3a39-f3fc279ca45d%40redhat.com


Best regards,
 Dmitry & Jesper





Re: MDAM techniques and Index Skip Scan patch

2022-03-24 Thread Jesper Pedersen

On 3/23/22 18:22, Dmitry Dolgov wrote:


The CF item could be set to RwF, what would you say, Jesper?



We want to thank the community for the feedback that we have received 
over the years for this feature. Hopefully a future implementation can 
use Tom's suggestions to get closer to a committable solution.


Here is the last CommitFest entry [1] for the archives.

RwF

[1] https://commitfest.postgresql.org/37/1741/

Best regards,
 Dmitry & Jesper





Re: MDAM techniques and Index Skip Scan patch

2022-03-23 Thread Dmitry Dolgov
> On Wed, Mar 23, 2022 at 05:32:46PM -0400, Tom Lane wrote:
> Dmitry Dolgov <9erthali...@gmail.com> writes:
> > On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:
> >> In short: I would throw out just about all the planner infrastructure
> >> that's been proposed so far.  It looks bulky, expensive, and
> >> drastically undercommented, and I don't think it's buying us anything
> >> of commensurate value.
>
> > Broadly speaking planner related changes proposed in the patch so far
> > are: UniqueKey, taken from the neighbour thread about select distinct;
> > list of uniquekeys to actually pass information about the specified
> > loose scan prefix into nbtree; some verification logic to prevent
> > applying skipping when it's not supported. I can imagine taking out
> > UniqueKeys and passing loose scan prefix in some other form (the other
> > parts seems to be essential) -- is that what you mean?
>
> My point is that for pure loose scans --- that is, just optimizing a scan,
> not doing AM-based duplicate-row-elimination --- you do not need to pass
> any new data to btree at all.  It can infer what to do on the basis of the
> set of index quals it's handed.
>
> The bigger picture here is that I think the reason this patch series has
> failed to progress is that it's too scattershot.  You need to pick a
> minimum committable feature and get that done, and then you can move on
> to the next part.  I think the minimum committable feature is loose scans,
> which will require a fair amount of work in access/nbtree/ but very little
> new planner code, and will be highly useful in their own right even if we
> never do anything more.
>
> In general I feel that the UniqueKey code is a solution looking for a
> problem, and that treating it as the core of the patchset is a mistake.
> We should be driving this work off of what nbtree needs to make progress,
> and not building more infrastructure elsewhere than we have to.  Maybe
> we'll end up with something that looks like UniqueKeys, but I'm far from
> convinced of that.

I see. I'll need some thinking time about how it may look like (will
probably return with more questions).

The CF item could be set to RwF, what would you say, Jesper?




Re: MDAM techniques and Index Skip Scan patch

2022-03-23 Thread Tom Lane
Dmitry Dolgov <9erthali...@gmail.com> writes:
> On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:
>> In short: I would throw out just about all the planner infrastructure
>> that's been proposed so far.  It looks bulky, expensive, and
>> drastically undercommented, and I don't think it's buying us anything
>> of commensurate value.

> Broadly speaking planner related changes proposed in the patch so far
> are: UniqueKey, taken from the neighbour thread about select distinct;
> list of uniquekeys to actually pass information about the specified
> loose scan prefix into nbtree; some verification logic to prevent
> applying skipping when it's not supported. I can imagine taking out
> UniqueKeys and passing loose scan prefix in some other form (the other
> parts seems to be essential) -- is that what you mean?

My point is that for pure loose scans --- that is, just optimizing a scan,
not doing AM-based duplicate-row-elimination --- you do not need to pass
any new data to btree at all.  It can infer what to do on the basis of the
set of index quals it's handed.

The bigger picture here is that I think the reason this patch series has
failed to progress is that it's too scattershot.  You need to pick a
minimum committable feature and get that done, and then you can move on
to the next part.  I think the minimum committable feature is loose scans,
which will require a fair amount of work in access/nbtree/ but very little
new planner code, and will be highly useful in their own right even if we
never do anything more.

In general I feel that the UniqueKey code is a solution looking for a
problem, and that treating it as the core of the patchset is a mistake.
We should be driving this work off of what nbtree needs to make progress,
and not building more infrastructure elsewhere than we have to.  Maybe
we'll end up with something that looks like UniqueKeys, but I'm far from
convinced of that.

regards, tom lane




Re: MDAM techniques and Index Skip Scan patch

2022-03-23 Thread Dmitry Dolgov
> On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:
> Peter Geoghegan  writes:
> > Like many difficult patches, the skip scan patch is not so much
> > troubled by problems with the implementation as it is troubled by
> > *ambiguity* about the design. Particularly concerning how skip scan
> > meshes with existing designs, as well as future designs --
> > particularly designs for other MDAM techniques. I've started this
> > thread to have a big picture conversation about how to think about
> > these things.
>
> Peter asked me off-list to spend some time thinking about the overall
> direction we ought to be pursuing here.  I have done that, and here
> are a few modest suggestions.

Thanks. To make sure I understand your proposal better, I have a couple
of questions:

> In short: I would throw out just about all the planner infrastructure
> that's been proposed so far.  It looks bulky, expensive, and
> drastically undercommented, and I don't think it's buying us anything
> of commensurate value.

Broadly speaking planner related changes proposed in the patch so far
are: UniqueKey, taken from the neighbour thread about select distinct;
list of uniquekeys to actually pass information about the specified
loose scan prefix into nbtree; some verification logic to prevent
applying skipping when it's not supported. I can imagine taking out
UniqueKeys and passing loose scan prefix in some other form (the other
parts seems to be essential) -- is that what you mean?




Re: MDAM techniques and Index Skip Scan patch

2022-03-22 Thread Tom Lane
Andres Freund  writes:
> On 2022-03-22 16:55:49 -0400, Tom Lane wrote:
>> BTW, I've had a bee in my bonnet for a long time about whether some of
>> nbtree's scan setup work could be done once during planning, rather than
>> over again during each indexscan start.

> It does show up in simple-index-lookup heavy workloads. Not as a major thing,
> but it's there. And it's just architecturally displeasing :)
> Are you thinking of just moving the setup stuff in nbtree (presumably parts of
> _bt_first() / _bt_preprocess_keys()) or also stuff in
> ExecIndexBuildScanKeys()?

Didn't really have specifics in mind.  The key stumbling block is
that some (not all) of the work depends on knowing the specific
values of the indexqual comparison keys, so while you could do
that work in advance for constant keys, you'd still have to be
prepared to do work at scan start for non-constant keys.  I don't
have a clear idea about how to factorize that effectively.

A couple of other random ideas in this space:

* I suspect that a lot of this work overlaps with the efforts that
btcostestimate makes along the way to getting a cost estimate.
So it's interesting to wonder whether we could refactor so that
btcostestimate is integrated with this hypothetical plan-time key
preprocessing and doesn't duplicate work.

* I think that we run through most or all of that preprocessing
logic even for internal catalog accesses, where we know darn well
how the keys are set up.  We ought to think harder about how we
could short-circuit pointless work in those code paths.

I don't think any of this is an essential prerequisite to getting
something done for loose index scans, which ISTM ought to be the first
point of attack for v16.  Loose index scans per se shouldn't add much
to the key preprocessing costs.  But these ideas likely would be
useful to look into before anyone starts on the more complicated
preprocessing that would be needed for the ideas in the MDAM paper.

regards, tom lane




Re: MDAM techniques and Index Skip Scan patch

2022-03-22 Thread Andres Freund
Hi,

On 2022-03-22 16:55:49 -0400, Tom Lane wrote:
> 4. I find each of the above ideas to be far more attractive than
> optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
> understand why the current patchsets seem to be driven largely
> by that single use-case.

It's something causing plenty pain in production environments... Obviously
it'd be even better if the optimization also triggered in cases like
  SELECT some_indexed_col FROM blarg GROUP BY some_indexed_col
which seems to be what ORMs like to generate.


> BTW, I've had a bee in my bonnet for a long time about whether some of
> nbtree's scan setup work could be done once during planning, rather than
> over again during each indexscan start.

It does show up in simple-index-lookup heavy workloads. Not as a major thing,
but it's there. And it's just architecturally displeasing :)

Are you thinking of just moving the setup stuff in nbtree (presumably parts of
_bt_first() / _bt_preprocess_keys()) or also stuff in
ExecIndexBuildScanKeys()?

The latter does show up a bit more heavily in profiles than nbtree specific
setup, and given that it's generic executor type stuff, seems even more
amenable to being moved to plan time.

Greetings,

Andres Freund




Re: MDAM techniques and Index Skip Scan patch

2022-03-22 Thread Tom Lane
Peter Geoghegan  writes:
> Like many difficult patches, the skip scan patch is not so much
> troubled by problems with the implementation as it is troubled by
> *ambiguity* about the design. Particularly concerning how skip scan
> meshes with existing designs, as well as future designs --
> particularly designs for other MDAM techniques. I've started this
> thread to have a big picture conversation about how to think about
> these things.

Peter asked me off-list to spend some time thinking about the overall
direction we ought to be pursuing here.  I have done that, and here
are a few modest suggestions.

1. Usually I'm in favor of doing this sort of thing in an index AM
agnostic way, but here I don't see much point.  All of the ideas at
stake rely fundamentally on having a lexicographically-ordered multi
column index; but we don't have any of those except btree, nor do
I think we're likely to get any soon.  This motivates the general
tenor of my remarks below, which is "do it in access/nbtree/ not in
the planner".

2. The MDAM paper Peter cited is really interesting.  You can see
fragments of those ideas in our existing btree code, particularly in
the scan setup stuff that detects redundant or contradictory keys and
determines a scan start strategy.  The special handling we implemented
awhile ago for ScalarArrayOp index quals is also a subset of what they
were talking about.  It seems to me that if we wanted to implement more
of those ideas, the relevant work should almost all be done in nbtree
proper.  The planner would need only minor adjustments: btcostestimate
would have to be fixed to understand the improvements, and there are
some rules in indxpath.c that prevent us from passing "too complicated"
sets of indexquals to the AM, which would need to be relaxed or removed
altogether.

3. "Loose" indexscan (i.e., sometimes re-descend from the tree root
to find the next index entry) is again something that seems like it's
mainly nbtree's internal problem.  Loose scan is interesting if we
have index quals for columns that are after the first column that lacks
an equality qual, otherwise not.  I've worried in the past that we'd
need planner/statistical support to figure out whether a loose scan
is likely to be useful compared to just plowing ahead in the index.
However, that seems to be rendered moot by the idea used in the current
patchsets, ie scan till we find that we'll have to step off the current
page, and re-descend at that point.  (When and if we find that that
heuristic is inadequate, we could work on passing some statistical data
forward.  But we don't need any in the v1 patch.)  Again, we need some
work in btcostestimate to understand how the index scan cost will be
affected, but I still don't see any pressing need for major planner
changes or plan tree contents changes.

4. I find each of the above ideas to be far more attractive than
optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
understand why the current patchsets seem to be driven largely
by that single use-case.  I wouldn't even bother with that for the
initial patch.  When we do get around to it, it still doesn't need
major planner support, I think --- again fixing the cost estimation
is the bulk of the work.  Munro's original 2014 patch showed that we
don't really need all that much to get the planner to build such a
plan; the problem is to convince it that that plan will be cheap.

In short: I would throw out just about all the planner infrastructure
that's been proposed so far.  It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value.  The part of the planner that actually needs
serious thought is btcostestimate, which has been woefully neglected in
both of the current patchsets.

BTW, I've had a bee in my bonnet for a long time about whether some of
nbtree's scan setup work could be done once during planning, rather than
over again during each indexscan start.  This issue might become more
pressing if the work becomes significantly more complicated/expensive,
which these ideas might cause.  But it's a refinement that could be
left for later --- and in any case, the responsibility would still
fundamentally be nbtree's.  I don't think the planner would do more
than call some AM routine that could add decoration to an IndexScan
plan node.

Now ... where did I put my flameproof vest?

regards, tom lane




Re: MDAM techniques and Index Skip Scan patch

2022-03-22 Thread Thomas Munro
On Tue, Mar 22, 2022 at 2:34 PM Andres Freund  wrote:
> IMO it's pretty clear that having "duelling" patches below one CF entry is a
> bad idea. I think they should be split, with inactive approaches marked as
> returned with feeback or whatnot.

I have the impression that this thread is getting some value from
having a CF entry, as a multi-person collaboration where people are
trading ideas and also making progress that no one wants to mark as
returned, but it's vexing for people managing the CF because it's not
really proposed for 15.  Perhaps what we lack is a new status, "Work
In Progress" or something?




Re: MDAM techniques and Index Skip Scan patch

2022-03-22 Thread Dmitry Dolgov
> On Mon, Mar 21, 2022 at 06:34:09PM -0700, Andres Freund wrote:
>
> > I don't mind having all of the alternatives under the same CF item, only
> > one being tested seems to be only a small limitation of cfbot.
>
> IMO it's pretty clear that having "duelling" patches below one CF entry is a
> bad idea. I think they should be split, with inactive approaches marked as
> returned with feeback or whatnot.

On the other hand even for patches with dependencies (i.e. the patch A
depends on the patch B) different CF items cause a lot of confusion for
reviewers. I guess for various flavours of the same patch it would be
even worse. But I don't have a strong opinion here.

> Either way, currently this patch fails on cfbot due to a new GUC:
> https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs
> https://cirrus-ci.com/task/5134905372835840

This seems to be easy to solve.
>From 5bae9fdf8b74e5996b606e78f8b2a5fb327e011b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v41 1/6] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c| 31 +
 src/backend/optimizer/path/Makefile |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +
 src/backend/optimizer/path/uniquekeys.c | 92 +
 src/backend/optimizer/plan/planner.c| 36 +-
 src/backend/optimizer/util/pathnode.c   | 32 ++---
 src/include/nodes/pathnodes.h   |  5 ++
 src/include/nodes/pg_list.h |  1 +
 src/include/optimizer/pathnode.h|  1 +
 src/include/optimizer/paths.h   |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index f843f861ef..a53a50f372 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1653,3 +1653,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2)
 		return 1;
 	return 0;
 }
+
+/*
+ * Return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+found = true;
+break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 86a35cdef1..e2be1fbf90 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 			 RelOptInfo *partrel,
@@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *		Checks if the new pathkey's equivalence class is the same as that of
+ *		any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey*old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1152,6 +1176,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+Lis

Re: MDAM techniques and Index Skip Scan patch

2022-03-21 Thread Andres Freund
Hi,

On 2022-01-22 22:37:19 +0100, Dmitry Dolgov wrote:
> > On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote:
> > Hi,
> >
> > On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:
> > >
> > > FYI, I've attached this thread to the CF item as an informational one,
> > > but as there are some patches posted here, folks may get confused. For
> > > those who have landed here with no context, I feel obliged to mention
> > > that now there are two alternative patch series posted under the same
> > > CF item:
> > >
> > > * the original one lives in [1], waiting for reviews since the last May
> > > * an alternative one posted here from Floris
> >
> > Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed.
> > It's nice to have the alternative approach threads linkied in the commit 
> > fest,
> > but it seems that the cfbot will use the most recent attachments as the only
> > patchset, thus leaving the "original" one untested.
> >
> > I'm not sure of what's the best approach in such situation.  Maybe creating 
> > a
> > different CF entry for each alternative, and link the other cf entry on the 
> > cf
> > app using the "Add annotations" or "Links" feature rather than attaching
> > threads?
> 
> I don't mind having all of the alternatives under the same CF item, only
> one being tested seems to be only a small limitation of cfbot.

IMO it's pretty clear that having "duelling" patches below one CF entry is a
bad idea. I think they should be split, with inactive approaches marked as
returned with feeback or whatnot.

Either way, currently this patch fails on cfbot due to a new GUC:
https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs
https://cirrus-ci.com/task/5134905372835840

Greetings,

Andres Freund




Re: Index Skip Scan (new UniqueKeys)

2022-01-24 Thread Zhihong Yu
On Mon, Jan 24, 2022 at 8:51 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote:
> > On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com>
> wrote:
> > > Besides that the new patch version contains some cleaning up and
> > > addresses commentaries around leaf page pinning from [1]. The idea
> > > behind the series structure is still the same: the first three patches
> > > contains the essence of the implementation (hoping to help concentrate
> > > review), the rest are more "experimental".
> > >
> > > [1]:
> > >
> https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com
> >
> > Hi,
> >
> > +   /* If same EC already is already in the list, then not unique */
> >
> > The word already is duplicated.
> >
> > + * make_pathkeys_for_uniquekeyclauses
> >
> > The func name in the comment is different from the actual func name.
>
> Thanks for the review! Right, both above make sense. I'll wait a bit if
> there will be more commentaries, and then post a new version with all
> changes at once.
>
> > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
> >
> > The year should be 2022 :-)
>
> Now you see how old is this patch :)
>
> > make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should
> > make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?
>
> It's actually placed there by analogy with make_pathkeys_for_sortclauses
> (immediately preceding function), so I think moving it into uniquekeys
> will only make more confusion.
>
> > +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
> > +bool allow_multinulls)
> >
> > It seems allow_multinulls is not checked in the func. Can the parameter
> be
> > removed ?
>
> Right, it could be removed. I believe it was somewhat important when the
> patch was tightly coupled with the UniqueKeys patch, where it was put in
> use.
>
> Hi,
It would be nice to take out this unused parameter for this patch.

The parameter should be added in patch series where it is used.

Cheers


Re: Index Skip Scan (new UniqueKeys)

2022-01-24 Thread Dmitry Dolgov
> On Sun, Jan 23, 2022 at 04:25:04PM -0800, Zhihong Yu wrote:
> On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Besides that the new patch version contains some cleaning up and
> > addresses commentaries around leaf page pinning from [1]. The idea
> > behind the series structure is still the same: the first three patches
> > contains the essence of the implementation (hoping to help concentrate
> > review), the rest are more "experimental".
> >
> > [1]:
> > https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com
>
> Hi,
>
> +   /* If same EC already is already in the list, then not unique */
>
> The word already is duplicated.
>
> + * make_pathkeys_for_uniquekeyclauses
>
> The func name in the comment is different from the actual func name.

Thanks for the review! Right, both above make sense. I'll wait a bit if
there will be more commentaries, and then post a new version with all
changes at once.

> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>
> The year should be 2022 :-)

Now you see how old is this patch :)

> make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should
> make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?

It's actually placed there by analogy with make_pathkeys_for_sortclauses
(immediately preceding function), so I think moving it into uniquekeys
will only make more confusion.

> +query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
> +bool allow_multinulls)
>
> It seems allow_multinulls is not checked in the func. Can the parameter be
> removed ?

Right, it could be removed. I believe it was somewhat important when the
patch was tightly coupled with the UniqueKeys patch, where it was put in
use.

> +   Path   *newpath;
> +
> +   newpath = (Path *) create_projection_path(root, rel, subpath,
> + scanjoin_target);
>
> You can remove variable newpath and assign to lfirst(lc) directly.

Yes, but I've followed the same style for create_projection_path as in
many other invocations of this function in planner.c -- I would prefer
to keep it uniform.

>  +add_path(RelOptInfo *parent_rel, Path *new_path)
> +add_unique_path(RelOptInfo *parent_rel, Path *new_path)
>
> It seems the above two func's can be combined into one func which
> takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter.

Sure, but here I've intentionally split it into separate functions,
otherwise a lot of not relevant call sites have to be updated to provide
the third parameter.




Re: Index Skip Scan (new UniqueKeys)

2022-01-23 Thread Zhihong Yu
On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote:
> > Hi,
> >
> > Here is another take on the patch with a couple of changes:
> >
> > * I've removed for now UniqueKeys parts. The interaction of skip scan &
> >   unique keys patch was actually not that big, so the main difference is
> >   that now the structure itself went away, a list of unique expressions
> >   is used instead. All the suggestions about how this feature should
> >   look like from the planning perspective are still there. On the one
> >   hand it will allow to develop both patches independently and avoid
> >   confusion for reviewers, on the other UniqueKeys could be easily
> >   incorporated back when needed.
> >
> > * Support for skipping in case of moving backward on demand (scroll
> >   cursor) is moved into a separate patch. This is implemented via
> >   returning false from IndexSupportsBackwardScan in case if it's a skip
> >   scan node, which in turn adds Materialize node on top when needed. The
> >   name SupportsBackwardScan was a bit confusing for me, but it seems
> >   it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
> >   Eventually those cases when BackwardScanDirection is used are still
> >   handled by amskip. This change didn't affect the test coverage, all
> >   the test cases supported in previous patch versions are still there.
> >
> >   About Materialize node, I guess it could be less performant than a
> >   "native" support, but it simplifies the implementation significantly
> >   to the point that most parts, which were causing questions before, are
> >   now located in the isolated patch. My idea here is to concentrate
> >   efforts on the first three patches in this series, and consider the
> >   rest of them as an experiment field.
> >
> > * IndexScan support was also relocated into a separate patch, the first
> >   three patches are now only about IndexOnlyScan.
> >
> > * Last bits of reviews were incorporated and rebased.
>
> While the patch is still waiting for a review, I was motivated by the
> thread [1] to think about it from the interface point of view. Consider
> an index skip scan being just like a normal index scan with a set of
> underspecified leading search keys. It makes sense to have the same
> structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm
> not sure if amskip is a good name to represent that, don't have anything
> better yet). But the "underspecified" part is currently indeed
> interpreted in a limited way -- as "missing" keys -- and is expressed
> only via the prefix size. Another option would be e.g. leading keys
> constrained by a range of values, so generally speaking it makes sense
> to extend amount of the information provided for skipping.
>
> As a naive approach I've added a new patch into the series, containing
> the extra data structure (ScanLooseKeys, doesn't have much meaning yet
> except somehow representing keys for skipping) used for index skip scan.
> Any thoughts about it?
>
> Besides that the new patch version contains some cleaning up and
> addresses commentaries around leaf page pinning from [1]. The idea
> behind the series structure is still the same: the first three patches
> contains the essence of the implementation (hoping to help concentrate
> review), the rest are more "experimental".
>
> [1]:
> https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com

Hi,

+   /* If same EC already is already in the list, then not unique */

The word already is duplicated.

+ * make_pathkeys_for_uniquekeyclauses

The func name in the comment is different from the actual func name.

+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group

The year should be 2022 :-)

make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should
make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?

+query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
+bool allow_multinulls)

It seems allow_multinulls is not checked in the func. Can the parameter be
removed ?

+   Path   *newpath;
+
+   newpath = (Path *) create_projection_path(root, rel, subpath,
+ scanjoin_target);

You can remove variable newpath and assign to lfirst(lc) directly.

 +add_path(RelOptInfo *parent_rel, Path *new_path)
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)

It seems the above two func's can be combined into one func which
takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter.

Cheers


Re: MDAM techniques and Index Skip Scan patch

2022-01-22 Thread Dmitry Dolgov
> On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote:
> Hi,
>
> On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:
> >
> > FYI, I've attached this thread to the CF item as an informational one,
> > but as there are some patches posted here, folks may get confused. For
> > those who have landed here with no context, I feel obliged to mention
> > that now there are two alternative patch series posted under the same
> > CF item:
> >
> > * the original one lives in [1], waiting for reviews since the last May
> > * an alternative one posted here from Floris
>
> Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed.
> It's nice to have the alternative approach threads linkied in the commit fest,
> but it seems that the cfbot will use the most recent attachments as the only
> patchset, thus leaving the "original" one untested.
>
> I'm not sure of what's the best approach in such situation.  Maybe creating a
> different CF entry for each alternative, and link the other cf entry on the cf
> app using the "Add annotations" or "Links" feature rather than attaching
> threads?

I don't mind having all of the alternatives under the same CF item, only
one being tested seems to be only a small limitation of cfbot.




Re: Index Skip Scan (new UniqueKeys)

2022-01-22 Thread Dmitry Dolgov
> On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote:
> Hi,
>
> Here is another take on the patch with a couple of changes:
>
> * I've removed for now UniqueKeys parts. The interaction of skip scan &
>   unique keys patch was actually not that big, so the main difference is
>   that now the structure itself went away, a list of unique expressions
>   is used instead. All the suggestions about how this feature should
>   look like from the planning perspective are still there. On the one
>   hand it will allow to develop both patches independently and avoid
>   confusion for reviewers, on the other UniqueKeys could be easily
>   incorporated back when needed.
>
> * Support for skipping in case of moving backward on demand (scroll
>   cursor) is moved into a separate patch. This is implemented via
>   returning false from IndexSupportsBackwardScan in case if it's a skip
>   scan node, which in turn adds Materialize node on top when needed. The
>   name SupportsBackwardScan was a bit confusing for me, but it seems
>   it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
>   Eventually those cases when BackwardScanDirection is used are still
>   handled by amskip. This change didn't affect the test coverage, all
>   the test cases supported in previous patch versions are still there.
>
>   About Materialize node, I guess it could be less performant than a
>   "native" support, but it simplifies the implementation significantly
>   to the point that most parts, which were causing questions before, are
>   now located in the isolated patch. My idea here is to concentrate
>   efforts on the first three patches in this series, and consider the
>   rest of them as an experiment field.
>
> * IndexScan support was also relocated into a separate patch, the first
>   three patches are now only about IndexOnlyScan.
>
> * Last bits of reviews were incorporated and rebased.

While the patch is still waiting for a review, I was motivated by the
thread [1] to think about it from the interface point of view. Consider
an index skip scan being just like a normal index scan with a set of
underspecified leading search keys. It makes sense to have the same
structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm
not sure if amskip is a good name to represent that, don't have anything
better yet). But the "underspecified" part is currently indeed
interpreted in a limited way -- as "missing" keys -- and is expressed
only via the prefix size. Another option would be e.g. leading keys
constrained by a range of values, so generally speaking it makes sense
to extend amount of the information provided for skipping.

As a naive approach I've added a new patch into the series, containing
the extra data structure (ScanLooseKeys, doesn't have much meaning yet
except somehow representing keys for skipping) used for index skip scan.
Any thoughts about it?

Besides that the new patch version contains some cleaning up and
addresses commentaries around leaf page pinning from [1]. The idea
behind the series structure is still the same: the first three patches
contains the essence of the implementation (hoping to help concentrate
review), the rest are more "experimental".

[1]: 
https://www.postgresql.org/message-id/flat/CAH2-WzmUscvoxVkokHxP=uptdjsi0tjkfpupd-cea35dvn-...@mail.gmail.com
>From 8b61823e523ceecbb2fd6faa1a229038bb981594 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v40 1/6] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c| 31 +
 src/backend/optimizer/path/Makefile |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +
 src/backend/optimizer/path/uniquekeys.c | 92 +
 src/backend/optimizer/plan/planner.c| 36 +-
 src/backend/optimizer/util/pathnode.c   | 32 ++---
 src/include/nodes/pathnodes.h   |  5 ++
 src/include/nodes/pg_list.h |  1 +
 src/include/optimizer/pathnode.h|  1 +
 src/include/optimizer/paths.h   |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index f843f861ef..a53a50f372 100644
---

Re: MDAM techniques and Index Skip Scan patch

2022-01-14 Thread Julien Rouhaud
Hi,

On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:
> 
> FYI, I've attached this thread to the CF item as an informational one,
> but as there are some patches posted here, folks may get confused. For
> those who have landed here with no context, I feel obliged to mention
> that now there are two alternative patch series posted under the same
> CF item:
> 
> * the original one lives in [1], waiting for reviews since the last May
> * an alternative one posted here from Floris

Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed.
It's nice to have the alternative approach threads linkied in the commit fest,
but it seems that the cfbot will use the most recent attachments as the only
patchset, thus leaving the "original" one untested.

I'm not sure of what's the best approach in such situation.  Maybe creating a
different CF entry for each alternative, and link the other cf entry on the cf
app using the "Add annotations" or "Links" feature rather than attaching
threads?




Re: MDAM techniques and Index Skip Scan patch

2022-01-13 Thread Dmitry Dolgov
> On Thu, Jan 13, 2022 at 03:27:08PM +, Floris Van Nee wrote:
> >
> > Could you send a rebased version?  In the meantime I will change the status
> > on the cf app to Waiting on Author.
>
> Attached a rebased version.

FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:

* the original one lives in [1], waiting for reviews since the last May
* an alternative one posted here from Floris

[1]: 
https://www.postgresql.org/message-id/flat/20200609102247.jdlatmfyeecg52fi@localhost




Re: MDAM techniques and Index Skip Scan patch

2022-01-13 Thread Julien Rouhaud
Hi,

On Sat, Oct 23, 2021 at 07:30:47PM +, Floris Van Nee wrote:
> 
> From the patch series above, v9-0001/v9-0002 is the UniqueKeys patch series,
> which focuses on the planner. v2-0001 is Dmitry's patch that extends it to
> make it possible to use UniqueKeys for the skip scan. Both of these are
> unfortunately still older patches, but because they are planner machinery
> they are less relevant to the discussion about the executor here.  Patch
> v2-0002 contains most of my work and introduces all the executor logic for
> the skip scan and hooks up the planner for DISTINCT queries to use the skip
> scan.  Patch v2-0003 is a planner hack that makes the planner pick a skip
> scan on virtually every possibility. This also enables the skip scan on the
> queries above that don't have a DISTINCT but where it could be useful.

The patchset doesn't apply anymore:
http://cfbot.cputube.org/patch_36_1741.log
=== Applying patches on top of PostgreSQL commit ID 
a18b6d2dc288dfa6e7905ede1d4462edd6a8af47 ===
=== applying patch ./v2-0001-Extend-UniqueKeys.patch
[...]
patching file src/include/optimizer/paths.h
Hunk #2 FAILED at 299.
1 out of 2 hunks FAILED -- saving rejects to file 
src/include/optimizer/paths.h.rej

Could you send a rebased version?  In the meantime I will change the status on
the cf app to Waiting on Author.




Re: MDAM techniques and Index Skip Scan patch

2021-10-25 Thread Jesper Pedersen

Hi Peter,

On 10/21/21 22:16, Peter Geoghegan wrote:

I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1] as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]. But there is a bigger picture here.



Thanks for starting this thread !

The Index Skip Scan patch could affect a lot of areas, so keeping MDAM 
in mind is definitely important.


However, I think the key part to progress on the "core" functionality 
(B-tree related changes) is to get the planner functionality in place 
first. Hopefully we can make progress on that during the November 
CommitFest based on Andy's patch.


Best regards,
 Jesper





Re: MDAM techniques and Index Skip Scan patch

2021-10-23 Thread Dmitry Dolgov
> On Thu, Oct 21, 2021 at 07:16:00PM -0700, Peter Geoghegan wrote:
>
> My general concern is that the skip scan patch may currently be
> structured in a way that paints us into a corner, MDAM-wise.
>
> Note that the MDAM paper treats skipping a prefix of columns as a case
> where the prefix is handled by pretending that there is a clause that
> looks like this: "WHERE date between -inf AND +inf" -- which is not so
> different from the original sales SQL query example that I have
> highlighted. We don't tend to think of queries like this (like my
> sales query) as in any way related to skip-scan, because we don't
> imagine that there is any skipping going on. But maybe we should
> recognize the similarities.

To avoid potential problems with extensibility in this sense, the
implementation needs to explicitly work with sets of disjoint intervals
of values instead of simple prefix size, one set of intervals per scan
key. An interesting idea, doesn't seem to be a big change in terms of
the patch itself.




MDAM techniques and Index Skip Scan patch

2021-10-21 Thread Peter Geoghegan
I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1] as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]. But there is a bigger picture here.

I don't necessarily expect to come away from this discussion with a
much better high level architecture for the patch, or any kind of
deeper insight, or even a frame of reference for further discussion. I
just think that we ought to *try* to impose some order on this stuff.

Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things. Many other MDAM techniques also seem highly appealing.
Much of the MDAM stuff is for data warehousing use-cases, while skip
scan/loose index scan is seen as more of an OLTP thing. But they are
still related, clearly.

I'd like to also talk about another patch, that ISTM had that same
quality -- it was also held back by high level design uncertainty. Back in 2018,
Tom abandoned a patch that transformed a star-schema style query with
left outer joins on dimension tables with OR conditions, into an
equivalent query that UNIONs together 2 distinct queries [3][4].

Believe it or not, I am now reminded of that patch by the example of
"IN() Lists", from page 5 of the paper. We see this example SQL query:

SELECT date, item_class, store, sum(total_sales)
FROM sales
WHERE date between '06/01/95' and '06/30/95' and
item_class IN (20,35,50) and
store IN (200,250)
GROUP BY dept, date, item_class, store;

Granted, this SQL might not seem directly relevant to Tom's patch at
first -- there is no join for the optimizer to even try to eliminate,
which was the whole basis of Jim Nasby's original complaint, which is
what spurred Tom to write the patch in the first place. But hear me
out: there is still a fact table (the sales table) with some
dimensions (the 'D' from 'MDAM') shown in the predicate. Moreover, the
table (and this SQL query) drives discussion of an optimization
involving transforming a predicate with many ORs (which is explicitly
said to be logically/semantically equivalent to the IN() lists from
the query). They transform the query into a bunch of disjunct clauses
that can easily be independently executed, and combined at the end
(see also "General OR Optimization" on page 6 of the paper).

Also...I'm not entirely sure that the intended underlying "physical
plan" is truly free of join-like scans. If you squint just right, you
might see something that you could think of as a "physical join" (at
least very informally). The whole point of this particular "IN()
Lists" example is that we get to the following, for each distinct
"dept" and "date" in the table:

dept=1, date='06/04/95', item_class=20, store=200
dept=1, date='06/04/95', item_class=20, store=250
dept=1, date='06/04/95', item_class=35, store=200
dept=1, date='06/04/95', item_class=35, store=250
dept=1, date='06/04/95', item_class=50, store=200
dept=1, date='06/04/95', item_class=50, store=250

There are 2400 such accesses in total after transformation -- imagine
additional lines like these, for every distinct combination of dept
and date (only for those dates that actually had sales, which they
enumerate up-front), for store 200 and 250, and item_class 20, 35, and
50. This adds up to 2400 lines in total. Even 2400 index probes will
be much faster than a full table scan, given that this is a large fact
table. The "sales" table is a clustered index whose keys are on the
columns "(dept, date, item_class, store)", per note at the top of page
4. The whole point is to avoid having any secondary indexes on this
fact table, without getting a full scan. We can just probe the primary
key 2400 times instead, following this transformation. No need for
secondary indexes.

The plan can be thought of as a DAG, at least informally. This is also
somewhat similar to what Tom was thinking about back in 2018. Tom had
to deduplicate rows during execution (IIRC using a UNION style ad-hoc
approach that sorted on TIDs), whereas I think that they can get away
with skipping that extra step. Page 7 says "MDAM removes duplicates
before reading the data, so it does not have to do any post read
operations to accomplish duplicate elimination (a common problem with
OR optimization)".

My general concern is that the skip scan patch may currently be
structured in a way that paints us into a corner, MDAM-wise.

Note that the MDAM paper treats skipping a prefix of columns as a case
where the prefix is handled by pretending that there is a 

Re: Index Skip Scan (new UniqueKeys)

2021-05-21 Thread Dmitry Dolgov
Hi,

Here is another take on the patch with a couple of changes:

* I've removed for now UniqueKeys parts. The interaction of skip scan &
  unique keys patch was actually not that big, so the main difference is
  that now the structure itself went away, a list of unique expressions
  is used instead. All the suggestions about how this feature should
  look like from the planning perspective are still there. On the one
  hand it will allow to develop both patches independently and avoid
  confusion for reviewers, on the other UniqueKeys could be easily
  incorporated back when needed.

* Support for skipping in case of moving backward on demand (scroll
  cursor) is moved into a separate patch. This is implemented via
  returning false from IndexSupportsBackwardScan in case if it's a skip
  scan node, which in turn adds Materialize node on top when needed. The
  name SupportsBackwardScan was a bit confusing for me, but it seems
  it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
  Eventually those cases when BackwardScanDirection is used are still
  handled by amskip. This change didn't affect the test coverage, all
  the test cases supported in previous patch versions are still there.

  About Materialize node, I guess it could be less performant than a
  "native" support, but it simplifies the implementation significantly
  to the point that most parts, which were causing questions before, are
  now located in the isolated patch. My idea here is to concentrate
  efforts on the first three patches in this series, and consider the
  rest of them as an experiment field.

* IndexScan support was also relocated into a separate patch, the first
  three patches are now only about IndexOnlyScan.

* Last bits of reviews were incorporated and rebased.
>From 074fc8a41b43dd8b10ab29eb880c9d161a1638d5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v39 1/5] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c| 31 +
 src/backend/optimizer/path/Makefile |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +
 src/backend/optimizer/path/uniquekeys.c | 92 +
 src/backend/optimizer/plan/planner.c| 36 +-
 src/backend/optimizer/util/pathnode.c   | 32 ++---
 src/include/nodes/pathnodes.h   |  5 ++
 src/include/nodes/pg_list.h |  1 +
 src/include/optimizer/pathnode.h|  1 +
 src/include/optimizer/paths.h   |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 94fb236daf..9f2c408a4e 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1537,3 +1537,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2)
 		return 1;
 	return 0;
 }
+
+/*
+ * Return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+found = true;
+break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index bd9a176d7d..f28547148d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 			 RelOptInfo *partrel,
@@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+

Re: Index Skip Scan (new UniqueKeys)

2021-03-17 Thread Tomas Vondra



On 3/17/21 6:02 PM, Dmitry Dolgov wrote:
>> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
>> Hi,
>>
>> I took a look at the new patch series, focusing mostly on the uniquekeys
>> part. It'd be a bit tedious to explain all the review comments here, so
>> attached is a patch series with a "review" patch for some of the parts.
> 
> Great, thanks.
> 
>> Most of it is fairly small (corrections to comments etc.), I'll go over
>> the more serious part so that we can discuss it here. I'll keep it split
>> per parts of the original patch series.
>> I suggest looking for XXX and FIXME comments in all the review patches.
>>
>>
>> 0001
>> 
>>
>> 
>>
>> 0002
>> 
>>
> 
> In fact both 0001 & 0002 belong to another thread, which these days
> span [1], [2]. I've included them only because they happened to be a
> dependency for index skip scan following David suggestions, sorry if
> it's confusing.
> 
> At the same time the author behind 0001 & 0002 is present in this thread
> as well, maybe Andy can answer these comments right here and better than me.
> 

Ah, sorry for the confusion. In that case the review comments probably
belong to the other threads, so we should move the discussion there.
It's not clear to me which of the threads is the right one.

>> 0003
>> 
>>
>> Just some comments/whitespace.
>>
>>
>> 0004
>> 
>>
>> I wonder why we don't include this in explain TEXT format? Seems it
>> might make it harder to write regression tests for this? It's easier to
>> just check that we deduced the right unique key(s) than having to
>> construct an example where it actually changes the plan.
> 
> Yeah, good point. I believe originally it was like that to not make
> explain too verbose for skip scans, but displaying prefix definitely
> could be helpful for testing, so will do this (and address other
> comments as well).
> 

Cool. Thanks.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Index Skip Scan (new UniqueKeys)

2021-03-17 Thread Dmitry Dolgov
> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
> Hi,
>
> I took a look at the new patch series, focusing mostly on the uniquekeys
> part. It'd be a bit tedious to explain all the review comments here, so
> attached is a patch series with a "review" patch for some of the parts.

Great, thanks.

> Most of it is fairly small (corrections to comments etc.), I'll go over
> the more serious part so that we can discuss it here. I'll keep it split
> per parts of the original patch series.
> I suggest looking for XXX and FIXME comments in all the review patches.
>
>
> 0001
> 
>
> 
>
> 0002
> 
>

In fact both 0001 & 0002 belong to another thread, which these days
span [1], [2]. I've included them only because they happened to be a
dependency for index skip scan following David suggestions, sorry if
it's confusing.

At the same time the author behind 0001 & 0002 is present in this thread
as well, maybe Andy can answer these comments right here and better than me.

> 0003
> 
>
> Just some comments/whitespace.
>
>
> 0004
> 
>
> I wonder why we don't include this in explain TEXT format? Seems it
> might make it harder to write regression tests for this? It's easier to
> just check that we deduced the right unique key(s) than having to
> construct an example where it actually changes the plan.

Yeah, good point. I believe originally it was like that to not make
explain too verbose for skip scans, but displaying prefix definitely
could be helpful for testing, so will do this (and address other
comments as well).

[1]: 
https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/flat/caku4awru35c9g3ce15jmvwh6b2hzf4hf7czukrsiktv7akr...@mail.gmail.com




Re: Index Skip Scan (new UniqueKeys)

2021-01-28 Thread Dmitry Dolgov
> On Thu, Jan 28, 2021 at 09:49:26PM +0900, Masahiko Sawada wrote:
> Hi Dmitry,
>
> Status update for a commitfest entry.
>
> This patch entry has been "Waiting on Author" on CF app and the
> discussion seems inactive from the last CF. Could you share the
> current status of this patch? Heikki already sent review comments and
> there was a discussion but the WoA status is correct? If it needs
> reviews, please rebase the patches and set it to "Needs Reviews" on CF
> app. If you're not working on this, I'm going to set it to "Returned
> with Feedback", barring objections.

Yes, I'm still on it. In fact, I've sketched up almost immediately
couple of changes to address Heikki feedback, but was distracted by
subscripting stuff. Will try to send new version of the patch soon.




Re: Index Skip Scan (new UniqueKeys)

2021-01-28 Thread Masahiko Sawada
Hi Dmitry,

On Sun, Oct 25, 2020 at 1:45 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Tue, Oct 06, 2020 at 05:20:39PM +0200, Dmitry Dolgov wrote:
> > > On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote:
> > >
> > > * I see the following compiler warning:
> > >
> > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
> > > In function ‘populate_baserel_uniquekeys’:
> > > /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
> > > warning: ‘expr’ may be used uninitialized in this function
> > > [-Wmaybe-uninitialized]
> > >   797 |   else if (!list_member(unique_index->rel->reltarget->exprs, 
> > > expr))
> > >   | ^~
> >
> > This is mostly for UniqueKeys patch, which is attached here only as a
> > dependency, but I'll prepare changes for that. Interesting enough I
> > can't reproduce this warning, but if I understand correctly gcc has some
> > history of spurious uninitialized warnings, so I guess it could be
> > version dependent.
> >
> > > * Perhaps the warning is related to this nearby code that I noticed
> > > Valgrind complains about:
> > >
> > > ==1083468== VALGRINDERROR-BEGIN
> > > ==1083468== Invalid read of size 4
> > > ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
> > > ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)
> >
> > This also belongs to UniqueKeys patch, but at least I can reproduce this
> > one. My guess is that nkeycolums should be used there, not ncolums,
> > which is visible in index_incuding tests. The same as previous one, will
> > prepare corresponding changes.
> >
> > > * Do we really need the AM-level boolean flag/argument named
> > > "scanstart"? Why not just follow the example of btgettuple(), which
> > > determines whether or not the scan has been initialized based on the
> > > current scan position?
> > >
> > > Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
> > > cannot or should not take the same approach as btgettuple(). And even
> > > if you can't take exactly the same approach, I would still think that
> > > the scan's opaque B-Tree state should remember if it's the first call
> > > to _bt_skip() (rather than some subsequent call) in some other way
> > > (e.g. carrying a "scanstart" bool flag directly).
> >
> > Yes, agree, carrying this flag inside the opaque state would be better.
>
> Here is a new version which doesn't require "scanstart" argument and
> contains few other changes to address the issues mentioned earlier. It's
> also based on the latest UniqueKeys patches with the valgrind issue
> fixed (as before they're attached also just for the references, you can
> find more in the original thread). I didn't rework commentaries yet,
> will post it soon (need to get an inspiration first, probably via
> reading Shakespeare unless someone has better suggestions).
>
> > > * Why is it okay to do anything important based on the
> > > _bt_scankey_within_page() return value?
> > >
> > > If the page is empty, then how can we know that it's okay to go to the
> > > next value? I'm concerned that there could be subtle bugs in this
> > > area. VACUUM will usually just delete the empty page. But it won't
> > > always do so, for a variety of reasons that aren't worth going into
> > > now. This could mask bugs in this area. I'm concerned about patterns
> > > like this one from _bt_skip():
> > >
> > > while (!nextFound)
> > > {
> > > 
> > >
> > > if (_bt_scankey_within_page(scan, so->skipScanKey,
> > > so->currPos.buf, dir))
> > > {
> > > ...
> > > }
> > > else
> > > /*
> > >  * If startItup could be not found within the current 
> > > page,
> > >  * assume we found something new
> > >  */
> > > nextFound = true;
> > > 
> > > }
> > >
> > > Why would you assume that "we found something new" here? In general I
> > > just don't understand the design of _bt_skip(). I get the basic idea
> > > of what you're trying to do, but it could really use better comments.
> >
> > Yeah, I'll put more efforts into clear comments. There are two different
> > ways in which _bt_scankey_within_page is being used.
> >
> > The first one is to check if it's possible to skip traversal of the tree
> > from root in case if what we're looking for could be on the current
> > page. In this case an empty page would mean we need to search from the
> > root, so not sure what could be the issue here?
> >
> > The second one (that you've highlighted above) I admit is probably the
> > most questionable part of the patch and open for suggestions how to
> > improve it. It's required for one 

Re: Index Skip Scan (new UniqueKeys)

2020-12-05 Thread Thomas Munro
On Wed, Dec 2, 2020 at 9:59 AM Heikki Linnakangas  wrote:
> On 01/12/2020 22:21, Dmitry Dolgov wrote:
> >> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:
> >> - I'm surprised you need a new index AM function (amskip) for this. Can't
> >> you just restart the scan with index_rescan()? The btree AM can check if 
> >> the
> >> new keys are on the same page, and optimize the rescan accordingly, like
> >> amskip does. That would speed up e.g. nested loop scans too, where the keys
> >> just happen to be clustered.
> >
> > An interesting point. At the moment I'm not sure whether it's possible
> > to implement skipping via index_rescan or not, need to take a look. But
> > checking if the new keys are on the same page would introduce some
> > overhead I guess, wouldn't it be too invasive to add it into already
> > existing btree AM?
>
> I think it'll be OK. But if it's not, you could add a hint argument to
> index_rescan() to hint the index AM that the new key is known to be
> greater than the previous key.

FWIW here's what I wrote about that years ago[1]:

> It works by adding a new index operation 'skip' which the executor
> code can use during a scan to advance to the next value (for some
> prefix of the index's columns). That may be a terrible idea and
> totally unnecessary... but let me explain my
> reasoning:
>
> 1. Perhaps some index implementations can do something better than a
> search for the next key value from the root. Is it possible or
> desirable to use the current position as a starting point for a btree
> traversal? I don't know.
>
> 2. It seemed that I'd need to create a new search ScanKey to use the
> 'rescan' interface for skipping to the next value, but I already had
> an insertion ScanKey so I wanted a way to just reuse that. But maybe
> there is some other way to reuse existing index interfaces, or maybe
> there is an easy way to make a new search ScanKey from the existing
>insertion ScanKey?

[1] 
https://www.postgresql.org/message-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww%40mail.gmail.com




Re: Index Skip Scan (new UniqueKeys)

2020-12-05 Thread Dmitry Dolgov
> On Tue, Dec 01, 2020 at 10:59:22PM +0200, Heikki Linnakangas wrote:
>
> > > - Does this optimization apply to bitmap index scans?
> >
> > No, from what I understand it doesn't.
>
> Would it be hard to add? Don't need to solve everything in the first
> version of this, but I think in principle you could do the same
> optimization for bitmap index scans, so if the current API can't do it,
> that's maybe an indication that the API isn't quite right.

I would expect it should not be hard as at the moment all parts seems
relatively generic. But of course I need to check, while it seems no one
had bitmap index scans in mind while developing this patch.

> > > I'm actually a bit confused why we need this condition. The IndexScan
> > > executor node should call amskip() only after checking the additional 
> > > quals,
> > > no?
> >
> > This part I don't quite get, what exactly you mean by checking the
> > additional quals in the executor node? But at the end of the day this
> > condition was implemented exactly to address the described issue, which
> > was found later and added to the tests.
>
> As I understand this, the executor logic goes like this:
>
> query: SELECT DISTINCT ON (a, b)  a, b FROM foo where c like '%y%' and a
> like 'a%' and b = 'b';
>
> 1. Call index_beginscan, keys: a >= 'a', b = 'b'
>
> 2. Call index_getnext, which returns first row to the Index Scan node
>
> 3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step
> 2 to get next tuple.
>
> 4. Return tuple to parent node
>
> 5. index_amskip(), to the next tuple with a > 'a'. Goto 2.
>
> The logic should work fine, even if there are quals that are not indexable,
> like "c like '%y'" in the above example. So why doesn't it work? What am I
> missing?

To remind myself how it works I went through this sequence, and from
what I understand the qual "c like '%y%'" is evaluated in this case in
ExecQual, not after index_getnext_tid (and values returned after
skipping are reported as filtered out). So when it comes to index_skip
only quals on a & b were evaluated. Or did you mean something else?

Another small detail is that in the current implementation there is no
goto 2 in the last step. Originally it was like that, but since skipping
return an exact position that we need there was something like "find a
value, then do one step back so that index_getnext will find it".
Unfortunately this stepping back part turns out to be a source of
troubles, and getting rid of it even allowed to make code somewhat more
concise. But of course I'm open for suggestions about improvements.




Re: Index Skip Scan (new UniqueKeys)

2020-12-01 Thread Heikki Linnakangas

On 01/12/2020 22:21, Dmitry Dolgov wrote:

On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:

I had a quick look at this patch. I haven't been following this thread, so
sorry if I'm repeating old arguments, but here we go:


Thanks!


- I'm surprised you need a new index AM function (amskip) for this. Can't
you just restart the scan with index_rescan()? The btree AM can check if the
new keys are on the same page, and optimize the rescan accordingly, like
amskip does. That would speed up e.g. nested loop scans too, where the keys
just happen to be clustered.


An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?


I think it'll be OK. But if it's not, you could add a hint argument to 
index_rescan() to hint the index AM that the new key is known to be 
greater than the previous key.



- Does this optimization apply to bitmap index scans?


No, from what I understand it doesn't.


Would it be hard to add? Don't need to solve everything in the first 
version of this, but I think in principle you could do the same 
optimization for bitmap index scans, so if the current API can't do it, 
that's maybe an indication that the API isn't quite right.



- This logic in build_index_paths() is not correct:


+   /*
+* Skip scan is not supported when there are qual conditions, 
which are not
+* covered by index. The reason for that is that those 
conditions are
+* evaluated later, already after skipping was applied.
+*
+* TODO: This implementation is too restrictive, and doesn't 
allow e.g.
+* index expressions. For that we need to examine index_clauses 
too.
+*/
+   if (root->parse->jointree != NULL)
+   {
+   ListCell *lc;
+
+   foreach(lc, (List *)root->parse->jointree->quals)
+   {
+   Node *expr, *qual = (Node *) lfirst(lc);
+   Var *var;
+   bool found = false;
+
+   if (!is_opclause(qual))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   expr = get_leftop(qual);
+
+   if (!IsA(expr, Var))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   var = (Var *) expr;
+
+   for (int i = 0; i < index->ncolumns; i++)
+   {
+   if (index->indexkeys[i] == 
var->varattno)
+   {
+   found = true;
+   break;
+   }
+   }
+
+   if (!found)
+   {
+   not_empty_qual = true;
+   break;
+   }
+   }
+   }


If you care whether the qual is evaluated by the index AM or not, you need
to also check that the operator is indexable. Attached is a query that
demonstrates that problem.
...
Also, you should probably check that the index quals are in the operator
family as that used for the DISTINCT.


Yes, good point, will change this in the next version.


I'm actually a bit confused why we need this condition. The IndexScan
executor node should call amskip() only after checking the additional quals,
no?


This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.


As I understand this, the executor logic goes like this:

query: SELECT DISTINCT ON (a, b)  a, b FROM foo where c like '%y%' and a 
like 'a%' and b = 'b';


1. Call index_beginscan, keys: a >= 'a', b = 'b'

2. Call index_getnext, which returns first row to the Index Scan node

3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto 
step 2 to get next tuple.


4. Return tuple to parent node

5. index_amskip(), to the next tuple with a > 'a'. Goto 2.

The logic should work fine, even if there are quals that are not 
indexable, like "c like '%y'" in the above example. So why doesn't it 
work? What am I 

Re: Index Skip Scan (new UniqueKeys)

2020-12-01 Thread Dmitry Dolgov
> On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:
>
> I had a quick look at this patch. I haven't been following this thread, so
> sorry if I'm repeating old arguments, but here we go:

Thanks!

> - I'm surprised you need a new index AM function (amskip) for this. Can't
> you just restart the scan with index_rescan()? The btree AM can check if the
> new keys are on the same page, and optimize the rescan accordingly, like
> amskip does. That would speed up e.g. nested loop scans too, where the keys
> just happen to be clustered.

An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?

> - Does this optimization apply to bitmap index scans?

No, from what I understand it doesn't.

> - This logic in build_index_paths() is not correct:
>
> > +   /*
> > +* Skip scan is not supported when there are qual conditions, 
> > which are not
> > +* covered by index. The reason for that is that those 
> > conditions are
> > +* evaluated later, already after skipping was applied.
> > +*
> > +* TODO: This implementation is too restrictive, and doesn't 
> > allow e.g.
> > +* index expressions. For that we need to examine index_clauses 
> > too.
> > +*/
> > +   if (root->parse->jointree != NULL)
> > +   {
> > +   ListCell *lc;
> > +
> > +   foreach(lc, (List *)root->parse->jointree->quals)
> > +   {
> > +   Node *expr, *qual = (Node *) lfirst(lc);
> > +   Var *var;
> > +   bool found = false;
> > +
> > +   if (!is_opclause(qual))
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +
> > +   expr = get_leftop(qual);
> > +
> > +   if (!IsA(expr, Var))
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +
> > +   var = (Var *) expr;
> > +
> > +   for (int i = 0; i < index->ncolumns; i++)
> > +   {
> > +   if (index->indexkeys[i] == 
> > var->varattno)
> > +   {
> > +   found = true;
> > +   break;
> > +   }
> > +   }
> > +
> > +   if (!found)
> > +   {
> > +   not_empty_qual = true;
> > +   break;
> > +   }
> > +   }
> > +   }
>
> If you care whether the qual is evaluated by the index AM or not, you need
> to also check that the operator is indexable. Attached is a query that
> demonstrates that problem.
> ...
> Also, you should probably check that the index quals are in the operator
> family as that used for the DISTINCT.

Yes, good point, will change this in the next version.

> I'm actually a bit confused why we need this condition. The IndexScan
> executor node should call amskip() only after checking the additional quals,
> no?

This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.




Re: Index Skip Scan (new UniqueKeys)

2020-11-30 Thread Heikki Linnakangas

On 24/10/2020 19:45, Dmitry Dolgov wrote:

Here is a new version which doesn't require "scanstart" argument and
contains few other changes to address the issues mentioned earlier. It's
also based on the latest UniqueKeys patches with the valgrind issue
fixed (as before they're attached also just for the references, you can
find more in the original thread). I didn't rework commentaries yet,
will post it soon (need to get an inspiration first, probably via
reading Shakespeare unless someone has better suggestions).


I had a quick look at this patch. I haven't been following this thread, 
so sorry if I'm repeating old arguments, but here we go:


- I'm surprised you need a new index AM function (amskip) for this. 
Can't you just restart the scan with index_rescan()? The btree AM can 
check if the new keys are on the same page, and optimize the rescan 
accordingly, like amskip does. That would speed up e.g. nested loop 
scans too, where the keys just happen to be clustered.


- Does this optimization apply to bitmap index scans?

- This logic in build_index_paths() is not correct:


+   /*
+* Skip scan is not supported when there are qual conditions, 
which are not
+* covered by index. The reason for that is that those 
conditions are
+* evaluated later, already after skipping was applied.
+*
+* TODO: This implementation is too restrictive, and doesn't 
allow e.g.
+* index expressions. For that we need to examine index_clauses 
too.
+*/
+   if (root->parse->jointree != NULL)
+   {
+   ListCell *lc;
+
+   foreach(lc, (List *)root->parse->jointree->quals)
+   {
+   Node *expr, *qual = (Node *) lfirst(lc);
+   Var *var;
+   bool found = false;
+
+   if (!is_opclause(qual))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   expr = get_leftop(qual);
+
+   if (!IsA(expr, Var))
+   {
+   not_empty_qual = true;
+   break;
+   }
+
+   var = (Var *) expr;
+
+   for (int i = 0; i < index->ncolumns; i++)
+   {
+   if (index->indexkeys[i] == 
var->varattno)
+   {
+   found = true;
+   break;
+   }
+   }
+
+   if (!found)
+   {
+   not_empty_qual = true;
+   break;
+   }
+   }
+   }


If you care whether the qual is evaluated by the index AM or not, you 
need to also check that the operator is indexable. Attached is a query 
that demonstrates that problem.


I'm actually a bit confused why we need this condition. The IndexScan 
executor node should call amskip() only after checking the additional 
quals, no?


Also, you should probably check that the index quals are in the operator 
family as that used for the DISTINCT.


- Heikki


index-skipscan-bug.sql
Description: application/sql


Re: Index Skip Scan (new UniqueKeys)

2020-10-06 Thread Dmitry Dolgov
> On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote:
>
> * I see the following compiler warning:
>
> /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
> In function ‘populate_baserel_uniquekeys’:
> /code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
> warning: ‘expr’ may be used uninitialized in this function
> [-Wmaybe-uninitialized]
>   797 |   else if (!list_member(unique_index->rel->reltarget->exprs, expr))
>   | ^~

This is mostly for UniqueKeys patch, which is attached here only as a
dependency, but I'll prepare changes for that. Interesting enough I
can't reproduce this warning, but if I understand correctly gcc has some
history of spurious uninitialized warnings, so I guess it could be
version dependent.

> * Perhaps the warning is related to this nearby code that I noticed
> Valgrind complains about:
>
> ==1083468== VALGRINDERROR-BEGIN
> ==1083468== Invalid read of size 4
> ==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
> ==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)

This also belongs to UniqueKeys patch, but at least I can reproduce this
one. My guess is that nkeycolums should be used there, not ncolums,
which is visible in index_incuding tests. The same as previous one, will
prepare corresponding changes.

> * Do we really need the AM-level boolean flag/argument named
> "scanstart"? Why not just follow the example of btgettuple(), which
> determines whether or not the scan has been initialized based on the
> current scan position?
>
> Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
> cannot or should not take the same approach as btgettuple(). And even
> if you can't take exactly the same approach, I would still think that
> the scan's opaque B-Tree state should remember if it's the first call
> to _bt_skip() (rather than some subsequent call) in some other way
> (e.g. carrying a "scanstart" bool flag directly).

Yes, agree, carrying this flag inside the opaque state would be better.

> * Why is it okay to do anything important based on the
> _bt_scankey_within_page() return value?
>
> If the page is empty, then how can we know that it's okay to go to the
> next value? I'm concerned that there could be subtle bugs in this
> area. VACUUM will usually just delete the empty page. But it won't
> always do so, for a variety of reasons that aren't worth going into
> now. This could mask bugs in this area. I'm concerned about patterns
> like this one from _bt_skip():
>
> while (!nextFound)
> {
> 
>
> if (_bt_scankey_within_page(scan, so->skipScanKey,
> so->currPos.buf, dir))
> {
> ...
> }
> else
> /*
>  * If startItup could be not found within the current 
> page,
>  * assume we found something new
>  */
> nextFound = true;
> 
> }
>
> Why would you assume that "we found something new" here? In general I
> just don't understand the design of _bt_skip(). I get the basic idea
> of what you're trying to do, but it could really use better comments.

Yeah, I'll put more efforts into clear comments. There are two different
ways in which _bt_scankey_within_page is being used.

The first one is to check if it's possible to skip traversal of the tree
from root in case if what we're looking for could be on the current
page. In this case an empty page would mean we need to search from the
root, so not sure what could be the issue here?

The second one (that you've highlighted above) I admit is probably the
most questionable part of the patch and open for suggestions how to
improve it. It's required for one particular case with a cursor when
scan advances forward but reads backward. What could happen here is we
found one valid item, but the next one e.g. do not pass scan key
conditions, and we end up with the previous item again. I'm not entirely
sure how presence of an empty page could change this scenario, could you
please show an example?

> *The "jump one more time if it's the same as at the beginning" thing
> seems scary to me. Maybe you should be doing something with the actual
> high key here.

Same as for the previous question, can you give a hint what do you mean
by "doing something with the actual high key"?




Re: Index Skip Scan (new UniqueKeys)

2020-09-21 Thread Peter Geoghegan
On Mon, Sep 21, 2020 at 5:59 PM Peter Geoghegan  wrote:
> That's all I have for now.

One more thing. I don't think that this should be a bitwise AND:

if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE))
{

}


-- 
Peter Geoghegan




Re: Index Skip Scan (new UniqueKeys)

2020-09-21 Thread Peter Geoghegan
On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Here is a new version that hopefully address most of the concerns
> mentioned in this thread so far. As before, first two patches are taken
> from UniqueKeys thread and attached only for the reference. List of
> changes includes:

Some thoughts on this version of the patch series (I'm focussing on
v36-0005-Btree-implementation-of-skipping.patch again):

* I see the following compiler warning:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
  797 |   else if (!list_member(unique_index->rel->reltarget->exprs, expr))
  | ^~

* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:

==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468==at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468==by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)
==1083468==by 0x56AEA5: set_plain_rel_size (allpaths.c:586)
==1083468==by 0x56AADB: set_rel_size (allpaths.c:412)
==1083468==by 0x56A8CD: set_base_rel_sizes (allpaths.c:323)
==1083468==by 0x56A5A7: make_one_rel (allpaths.c:185)
==1083468==by 0x5AB426: query_planner (planmain.c:269)
==1083468==by 0x5AF02C: grouping_planner (planner.c:2058)
==1083468==by 0x5AD202: subquery_planner (planner.c:1015)
==1083468==by 0x5ABABF: standard_planner (planner.c:405)
==1083468==by 0x5AB7F8: planner (planner.c:275)
==1083468==by 0x6E6F84: pg_plan_query (postgres.c:875)
==1083468==by 0x6E70C4: pg_plan_queries (postgres.c:966)
==1083468==by 0x6E7497: exec_simple_query (postgres.c:1158)
==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468==by 0x624284: BackendRun (postmaster.c:4541)
==1083468==by 0x623995: BackendStartup (postmaster.c:4225)
==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==by 0x514AF2: main (main.c:209)
==1083468==  Address 0x75f13e0 is 4,448 bytes inside a block of size
8,192 alloc'd
==1083468==at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1083468==by 0x8C15C8: AllocSetAlloc (aset.c:919)
==1083468==by 0x8CEA52: palloc (mcxt.c:964)
==1083468==by 0x267F25: systable_beginscan (genam.c:373)
==1083468==by 0x8682CE: SearchCatCacheMiss (catcache.c:1359)
==1083468==by 0x868167: SearchCatCacheInternal (catcache.c:1299)
==1083468==by 0x867E2C: SearchCatCache1 (catcache.c:1167)
==1083468==by 0x8860B2: SearchSysCache1 (syscache.c:1123)
==1083468==by 0x8BD482: check_enable_rls (rls.c:66)
==1083468==by 0x68A113: get_row_security_policies (rowsecurity.c:134)
==1083468==by 0x683C2C: fireRIRrules (rewriteHandler.c:2045)
==1083468==by 0x687340: QueryRewrite (rewriteHandler.c:3962)
==1083468==by 0x6E6EB1: pg_rewrite_query (postgres.c:784)
==1083468==by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700)
==1083468==by 0x6E7476: exec_simple_query (postgres.c:1155)
==1083468==by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468==by 0x624284: BackendRun (postmaster.c:4541)
==1083468==by 0x623995: BackendStartup (postmaster.c:4225)
==1083468==by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468==by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==
==1083468== VALGRINDERROR-END

(You'll see the same error if you run Postgres Valgrind + "make
installcheck", though I don't think that the queries in question are
tests that you yourself wrote.)

* IndexScanDescData.xs_itup comments could stand to be updated here --
IndexScanDescData.xs_want_itup is no longer just about index-only
scans.

* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?

Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).

A part of my objection to "scanstart" is that it seems to require that
much of the code within _bt_skip() get another level of
indentation...which makes it even more difficult to follow.

* I don't understand what _bt_scankey_within_page() comments mean when
they refer to "the page highkey". It looks like this function examines
the highest data 

Re: Index Skip Scan (new UniqueKeys)

2020-08-02 Thread Andy Fan
Hi David:

Thanks for looking into this.

On Fri, Jul 31, 2020 at 11:07 AM David Rowley  wrote:

> On Mon, 13 Jul 2020 at 10:18, Floris Van Nee 
> wrote:
> > One question about the unique keys - probably for Andy or David: I've
> looked in the archives to find arguments for/against using Expr nodes or
> EquivalenceClasses in the Unique Keys patch. However, I couldn't really
> find a clear answer about why the current patch uses Expr rather than
> EquivalenceClasses. At some point David mentioned "that probably Expr nodes
> were needed rather than EquivalenceClasses", but it's not really clear to
> me why. What were the thoughts behind this?
>
> I'm still not quite sure on this either way.  I did think
> EquivalenceClasses were more suitable before I wrote the POC patch for
> unique keys.  But after that, I had in mind that Exprs might be
> better. The reason I thought this was due to the fact that the
> DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
> EquivalenceClasses then checking to see if the DISTINCT can be skipped
> turned into something more complex that required looking through lists
> of ec_members rather than just checking if the uniquekey exprs were a
> subset of the DISTINCT clause.
>

> Thinking about it a bit harder, if we did use Exprs then it would mean
> it a case like the following wouldn't work for Andy's DISTINCT no-op
> stuff.
>
> CREATE TABLE xy (x int primary key, y int not null);
>
> SELECT DISTINCT y FROM xy WHERE x=y;
>
> whereas if we use EquivalenceClasses then we'll find that we have an
> EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
> containing that EquivalenceClass.


> Also, looking at what Andy wrote to make a case like the following
> work in his populate_baserel_uniquekeys() function in the 0002 patch:
>
> CREATE TABLE ab (a int, b int, primary key(a,b));
> SELECT DISTINCT a FROM ab WHERE b = 1;
>
> it's a bit uninspiring. Really what we want here when checking if we
> can skip doing the DISTINCT is a UniqueKey set using
> EquivalenceClasses as we can just insist that any unmatched UniqueKey
> items have an ec_is_const == true. However, that means we have to loop
> through the ec_members of the EquivalenceClasses in the uniquekeys
> during the DISTINCT check.  That's particularly bad when you consider
> that in a partitioned table case there might be an ec_member for each
> child partition and there could be 1000s of child partitions and
> following those ec_members chains is going to be too slow.
>
> My current thoughts are that we should be using EquivalenceClasses but
> we should first add some infrastructure to make them perform better.
> My current thoughts are that we do something like what I mentioned in
> [1] or something more like what Andres mentions in [2].  After that,
> we could either make EquivalenceClass.ec_members a hash table or
> binary search tree. Or even perhaps just have a single hash table/BST
> for all EquivalenceClasses that allows very fast lookups from {Expr}
> -> {EquivalenceClass}.  I think an Expr can only belong in a single
> non-merged EquivalenceClass. So when we do merging of
> EquivalenceClasses we could just repoint that data structure to point
> to the new EquivalenceClass. We'd never point to ones that have
> ec_merged != NULL.  This would also allow us to fix the poor
> performance in regards to get_eclass_for_sort_expr() for partitioned
> tables.
>
> So, it seems the patch dependency chain for skip scans just got a bit
> longer :-(
>
>
I admit that  EquivalenceClasses has a better expressive power.   There are
2 more
cases we can handle better with EquivalenceClasses.  SELECT DISTINCT a, b,
c
FROM t WHERE a = b;  Currently the UniqueKey is (a, b, c), but it is better
be (a, c)
and (b, c).   The other case happens similarly in group by case.

After realizing this,  I am still hesitant to do that, due to the
complexity.  If we do that,
we may have to maintain a EquivalenceClasses in one more place or make the
existing
EquivalenceClasses List longer, for example:  SELECT pk FROM t;  The
current
infrastructure doesn't create any EquivalenceClasses for pk.  So we have to
create
a new one in this case and reuse some existing ones in other cases.
Finally since the
EquivalenceClasses is not so straight to upper user, we have to depend on
the
infrastructure change to look up an EquivalenceClasses quickly from an
Expr.

I rethink more about the case you provide above,  IIUC, there is such issue
for joinrel.
then we can just add a EC checking for populate_baserel_uniquekeys.  As for
the
DISTINCT/GROUP BY case,  we should build the UniqueKeys from
root->distinct_pathkeys
and root->group_pathkeys where the  EquivalenceClasses are already there.

I am still not insisting on either Expr or EquivalenceClasses right now,
if we need to
change it to EquivalenceClasses,  I'd see if we need to have more places to
take
care before doing that.

-- 
Best Regards
Andy Fan


Re: Index Skip Scan (new UniqueKeys)

2020-07-30 Thread David Rowley
On Mon, 13 Jul 2020 at 10:18, Floris Van Nee  wrote:
> One question about the unique keys - probably for Andy or David: I've looked 
> in the archives to find arguments for/against using Expr nodes or 
> EquivalenceClasses in the Unique Keys patch. However, I couldn't really find 
> a clear answer about why the current patch uses Expr rather than 
> EquivalenceClasses. At some point David mentioned "that probably Expr nodes 
> were needed rather than EquivalenceClasses", but it's not really clear to me 
> why. What were the thoughts behind this?

I'm still not quite sure on this either way.  I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys.  But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.

Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.

CREATE TABLE xy (x int primary key, y int not null);

SELECT DISTINCT y FROM xy WHERE x=y;

whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass.

Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:

CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;

it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check.  That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.

My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1] or something more like what Andres mentions in [2].  After that,
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}.  I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL.  This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.

So, it seems the patch dependency chain for skip scans just got a bit longer :-(

David

[1] 
https://www.postgresql.org/message-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/20190920051857.2fhnvhvx4qdddviz%40alap3.anarazel.de#c3add3919c534591eae2179a6c82742c




Re: Index Skip Scan (new UniqueKeys)

2020-07-27 Thread Dmitry Dolgov
> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote:
>
> > Well, it's obviously wrong, thanks for noticing. What is necessary is to
> > compare two index tuples, the start and the next one, to test if they're
> > the same (in which case if I'm not mistaken probably we can compare item
> > pointers). I've got this question when I was about to post a new version
> > with changes to address feedback from Andy, now I'll combine them and
> > send a cumulative patch.
>
> This sounds like approximately the same problem as the one that
> _bt_killitems() has to deal with as of Postgres 13. This is handled in
> a way that is admittedly pretty tricky, even though the code does not
> need to be 100% certain that it's "the same" tuple. Deduplication kind
> of makes that a fuzzy concept. In principle there could be one big
> index tuple instead of 5 tuples, even though the logical contents of
> the page have not been changed between the time we recording heap TIDs
> in local and the time _bt_killitems() tried to match on those heap
> TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
> deduplication pass could do that. Your code needs to be prepared for
> stuff like that.
>
> Ultimately posting list tuples are just a matter of understanding the
> on-disk representation -- a "Small Matter of Programming". Even
> without deduplication there are potential hazards from the physical
> deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
> not code that runs in VACUUM, despite the name). Make sure that you
> hold a buffer pin on the leaf page throughout, because you need to do
> that to make sure that VACUUM cannot concurrently recycle heap TIDs.
> If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
> subtly broken. _bt_killitems() is safe because it either holds on to a
> pin or gives up when the LSN changes at all. (ISTM that your only
> choice is to hold on to a leaf page pin, since you cannot just decide
> to give up in the way that _bt_killitems() sometimes can.)

I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.




RE: Index Skip Scan (new UniqueKeys)

2020-07-23 Thread Floris Van Nee
> 
> One UniqueKey can have multiple corresponding expressions, which gives us
> also possibility of having one unique key with (t1.a, t2.a) and it looks now
> similar to EquivalenceClass.
> 

I believe the current definition of a unique key with two expressions (t1.a, 
t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker 
guarantees than uniqueness on (t1.a) and uniqueness on (t2.a).

> 
> The idea behind this query sounds questionable to me, more transparent
> would be to do this without distinct, skipping will actually do exactly the 
> same
> stuff just under another name. But if allowing skipping on constants do not
> bring significant changes in the code probably it's fine.
> 

Yeah indeed, I didn't say it's a query that people should generally write. :-) 
It's better to write as a regular SELECT with LIMIT 1 of course. However, it's 
more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) 
* FROM t1 runs fast, then it doesn't make sense to the user if a SELECT 
DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the 
implementation more consistent with little code changes.

> >
> > Yeah, there's definitely some double work there, but the actual impact may
> be limited - it doesn't actually allocate a new path key, but it looks it up 
> in
> root->canon_pathkeys and returns that path key.
> > I wrote it like this, because I couldn't find a way to identify from a 
> > certain
> PathKey the actual location in the index of that column. The constructed path
> keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes
> path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But
> how to get from PathKey=b to that number 3, I didn't find a solid way except
> doing this. Maybe there is though?
> 
> I don't think there is a direct way, but why not modify build_index_paths to
> also provide this information, or compare index_pathkeys expressions with
> indextlist without actually create those pathkeys again?
> 

I agree there could be other ways - I don't currently have a strong preference 
for either. I can have a look at this later.

> And couple of words about this thread [1]. It looks to me like a strange way
> of interacting with the community. Are you going to duplicate there
> everything, or what are your plans? At the very least you could try to include
> everyone involved in the recipients list, not exclude some of the authors.
> 

When I wrote the first mail in the thread, I went to this thread [1] and 
included everyone from there, but I see now that I only included the to: and 
cc: people and forgot the original thread author, you. I'm sorry about that - I 
should've looked better to make sure I had everyone.
In any case, my plan is to keep the patch at least applicable to master, as I 
believe it can be helpful for discussions about both patches.

[1] 
https://www.postgresql.org/message-id/20200609102247.jdlatmfyeecg52fi%40localhost




Re: Index Skip Scan (new UniqueKeys)

2020-07-23 Thread Dmitry Dolgov
> On Tue, Jul 14, 2020 at 06:18:50PM +, Floris Van Nee wrote:
>
> Due to the other changes I made in 
> create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan 
> now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to 
> give an actual failing test case now. However, since all code filters out 
> constant keys, I think uniqueness should do it too - otherwise you could get 
> into problems later on. It's also more consistent. If you already know 
> something is unique by just (b), it doesn't make sense to store that it's 
> unique by (a,b). Now that I think of it, the best place to do this 
> EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, 
> rather than build_uniquekeys though. It's probably good to move it there.

That would be my suggestion as well.

> > Along the lines I'm also curious about this part:
> >
> > -   ListCell   *k;
> > -   List *exprs = NIL;
> > -
> > -   foreach(k, ec->ec_members)
> > -   {
> > -   EquivalenceMember *mem = (EquivalenceMember *)
> > lfirst(k);
> > -   exprs = lappend(exprs, mem->em_expr);
> > -   }
> > -
> > -   result = lappend(result, makeUniqueKey(exprs, false, false));
> > +   EquivalenceMember *mem = (EquivalenceMember*)
> > +lfirst(list_head(ec->ec_members));
> >
> > I'm curious about this myself, maybe someone can clarify. It looks like
> > generaly speaking there could be more than one member (if not
> > ec_has_volatile), which "representing knowledge that multiple items are
> > effectively equal". Is this information is not interesting enough to 
> > preserve it
> > in unique keys?
>
> Yeah, that's a good question. Hence my question about the choice for Expr 
> rather than EquivalenceClass for the Unique Keys patch to Andy/David. When 
> storing just Expr, it is rather difficult to check equivalence in joins for 
> example. Suppose, later on we decide to add join support to the distinct skip 
> scan. Consider a query like this:
> SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
> As far as my understanding goes (I didn't look into it in detail though), I 
> think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That 
> results in a UniqueKey with expr (t1.a) (because currently we only take the 
> first Expr in the list to construct the UniqueKey). We could also construct 
> *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't 
> think that's the correct approach either, as it will explode when you have 
> multiple pathkeys, each having multiple Expr inside their EqClasses.

One UniqueKey can have multiple corresponding expressions, which gives
us also possibility of having one unique key with (t1.a, t2.a) and it
looks now similar to EquivalenceClass.

> > > - the distinct_pathkeys may be NULL, even though there's a possibility for
> > skipping. But it wouldn't create the uniquekeys in this case. This makes the
> > planner not choose skip scans even though it could. For example in queries
> > that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
> > is constant, it's eliminated from regular pathkeys.
> >
> > What would be the point of skipping if it's a constant?
>
> For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
> There may be 1000s of records with a=1. We're only interested in the first 
> one though. The traditional non-skip approach would still scan all records 
> with a=1. Skip would just fetch the first one with a=1 and then skip to the 
> next prefix and stop the scan.

The idea behind this query sounds questionable to me, more transparent
would be to do this without distinct, skipping will actually do exactly
the same stuff just under another name. But if allowing skipping on
constants do not bring significant changes in the code probably it's
fine.

> > > - to combat the issues mentioned earlier, there's now a check in
> > build_index_paths that checks if the query_pathkeys matches the
> > useful_pathkeys. Note that we have to use the path keys here rather than
> > any of the unique keys. The unique keys are only Expr nodes - they do not
> > contain the necessary information about ordering. Due to elimination of
> > some constant path keys, we have to search the attributes of the index to
> > find the correct prefix to use in skipping.
> >
> > IIUC here you mean this function, right?
> >
> > + prefix = find_index_prefix_for_pathkey(root,
> > +
> > index,
> > +
> > BackwardScanDirection,
> > +
> > llast_node(PathKey,
> > +
> > root->distinct_pathkeys));
> >
> > Doesn't it duplicate the job already done in build_index_pathkeys by 
> > building
> > those pathkeys again? If yes, probably it's possible to reuse 
> > useful_pathkeys.
> > Not sure about unordered indexes, but looks like query_pathkeys should
> > also match in this case.
> >
>
> Yeah, there's definitely some double work there, but the actual impact may be 
> limited - it doesn't actually allocate a new path key, but it looks 

Re: Index Skip Scan (new UniqueKeys)

2020-07-21 Thread Peter Geoghegan
On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > > +   currItem = >currPos.items[so->currPos.lastItem];
> > > +   itup = (IndexTuple) (so->currTuples + 
> > > currItem->tupleOffset);
> > > +   nextOffset = ItemPointerGetOffsetNumber(>t_tid);
>
> Do you mean this last part with t_tid, which could also have a tid array
> in case of posting tuple format?

Yeah. There is a TID array at the end of the index when the tuple is a
posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't
safe to assume that t_tid is a heap TID for this reason, even in code
that only ever considers data items (that is, non high-key tuples AKA
non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot
be posting list tuples either, so you'll see the assumption that t_tid
is just a heap TID in parts of nbtinsert.c -- though only for the
new/incoming item.)

> Well, it's obviously wrong, thanks for noticing. What is necessary is to
> compare two index tuples, the start and the next one, to test if they're
> the same (in which case if I'm not mistaken probably we can compare item
> pointers). I've got this question when I was about to post a new version
> with changes to address feedback from Andy, now I'll combine them and
> send a cumulative patch.

This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.

Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)

Note that the rules surrounding buffer locks/pins for nbtree were
tightened up a tiny bit today -- see commit 4a70f829. Also, it's no
longer okay to use raw LockBuffer() calls in nbtree, so you're going
to have to fix that up when you next rebase -- you must use the new
_bt_lockbuf() wrapper function instead, so that the new Valgrind
instrumentation is used. This shouldn't be hard.

Perhaps you can use Valgrind to verify that this patch doesn't have
any unsafe buffer accesses. I recall problems like that in earlier
versions of the patch series. Valgrind should be able to detect most
bugs like that (though see comments within _bt_lockbuf() for details
of a bug in this area that Valgrind cannot detect even now).

-- 
Peter Geoghegan




Re: Generic Index Skip Scan

2020-07-15 Thread David Rowley
On Thu, 16 Jul 2020 at 07:52, Floris Van Nee  wrote:
>
> Besides the great efforts that Dmitry et al. are putting into the skip scan 
> for DISTINCT queries [1], I’m also still keen on extending the use of it 
> further. I’d like to address the limited cases in which skipping can occur 
> here. A few months ago I shared an initial rough patch that provided a 
> generic skip implementation, but lacked the proper planning work [2]. I’d 
> like to share a second patch set that provides an implementation of the 
> planner as well. Perhaps this can lead to some proper discussions how we’d 
> like to shape this patch further.
>
> Please see [2] for an introduction and some rough performance comparisons. 
> This patch improves upon those, because it implements proper cost estimation 
> logic. It will now only choose the skip scan if it’s deemed to be cheaper 
> than using a regular index scan. Other than that, all the features are still 
> there. The skip scan can be used in many more types of queries than in the 
> original DISTINCT patch as provided in [1], making it more performant and 
> also more predictable for users.
>
> I’m keen on receiving feedback on this idea and on the patch.

I don't think anyone ever thought the feature would be limited to just
making DISTINCT go faster. There's certain to be more usages in the
future.

However, for me it would be premature to look at this now.

David




RE: Index Skip Scan (new UniqueKeys)

2020-07-14 Thread Floris Van Nee
> 
> > - the uniquekeys that is built, still contains some redundant keys, that are
> normally eliminated from the path keys lists.
> 
> I guess you're talking about:
> 
> + if (EC_MUST_BE_REDUNDANT(ec))
> +   continue;
> 
> Can you add some test cases to your changes to show the effect of it? It
> seem to me redundant keys are already eliminated at this point by either
> make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be
> I've missed something.
> 

The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks 
for uniqueness using pathkey_is_unique, but not for constantness. Consider a 
query like:
SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c

All the regular path keys filter out 'a' for constantness here - so they would 
end up with a distinct pathkeys of (b) and sort path keys of (b,c).
The unique keys would end up with (a,b) though. In the previous patch, it'd 
compared in create_distinct_paths, the pathkeys to the unique keys, so it 
wouldn't consider the skip scan.
Due to the other changes I made in 
create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan 
now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to 
give an actual failing test case now. However, since all code filters out 
constant keys, I think uniqueness should do it too - otherwise you could get 
into problems later on. It's also more consistent. If you already know 
something is unique by just (b), it doesn't make sense to store that it's 
unique by (a,b). Now that I think of it, the best place to do this 
EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, 
rather than build_uniquekeys though. It's probably good to move it there.

> Along the lines I'm also curious about this part:
> 
> - ListCell   *k;
> - List *exprs = NIL;
> -
> - foreach(k, ec->ec_members)
> - {
> - EquivalenceMember *mem = (EquivalenceMember *)
> lfirst(k);
> - exprs = lappend(exprs, mem->em_expr);
> - }
> -
> - result = lappend(result, makeUniqueKey(exprs, false, false));
> + EquivalenceMember *mem = (EquivalenceMember*)
> +lfirst(list_head(ec->ec_members));
> 
> I'm curious about this myself, maybe someone can clarify. It looks like
> generaly speaking there could be more than one member (if not
> ec_has_volatile), which "representing knowledge that multiple items are
> effectively equal". Is this information is not interesting enough to preserve 
> it
> in unique keys?

Yeah, that's a good question. Hence my question about the choice for Expr 
rather than EquivalenceClass for the Unique Keys patch to Andy/David. When 
storing just Expr, it is rather difficult to check equivalence in joins for 
example. Suppose, later on we decide to add join support to the distinct skip 
scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I 
think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results 
in a UniqueKey with expr (t1.a) (because currently we only take the first Expr 
in the list to construct the UniqueKey). We could also construct *two* unique 
keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's 
the correct approach either, as it will explode when you have multiple 
pathkeys, each having multiple Expr inside their EqClasses.
That makes it difficult to check if we can perform the DISTINCT skip scan on 
table t2 as well (theoretically we could, but for that we need to check 
equivalence classes rather than expressions).

> 
> > - the distinct_pathkeys may be NULL, even though there's a possibility for
> skipping. But it wouldn't create the uniquekeys in this case. This makes the
> planner not choose skip scans even though it could. For example in queries
> that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
> is constant, it's eliminated from regular pathkeys.
> 
> What would be the point of skipping if it's a constant?

For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one 
though. The traditional non-skip approach would still scan all records with 
a=1. Skip would just fetch the first one with a=1 and then skip to the next 
prefix and stop the scan.

> 
> > - to combat the issues mentioned earlier, there's now a check in
> build_index_paths that checks if the query_pathkeys matches the
> useful_pathkeys. Note that we have to use the path keys here rather than
> any of the unique keys. The unique keys are only Expr nodes - they do not
> contain the necessary information about ordering. Due to elimination of
> some constant path keys, we have to search the attributes of the index to
> find the correct prefix to use in skipping.
> 
> IIUC here you mean this function, right?
> 
> + prefix = find_index_prefix_for_pathkey(root,

Re: Index Skip Scan (new UniqueKeys)

2020-07-14 Thread Dmitry Dolgov
> On Sun, Jul 12, 2020 at 12:48:47PM +, Floris Van Nee wrote:
> >
> > Good point, thanks for looking at this. With the latest planner version 
> > there
> > are indeed more possibilities to use skipping. It never occured to me that
> > some of those paths will still rely on index scan returning full data set. 
> > I'll look
> > in details and add verification to prevent putting something like this on 
> > top of
> > skip scan in the next version.
>
> I believe the required changes are something like in attached patch. There 
> were a few things I've changed:
> - build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, 
> it would create two unique keys, one with a and one with b. However, it 
> should be one unique key with (a,b).

Yes, I've also noticed that while preparing fix for index scan not
covered by index and included it.

> - the uniquekeys that is built, still contains some redundant keys, that are 
> normally eliminated from the path keys lists.

I guess you're talking about:

+ if (EC_MUST_BE_REDUNDANT(ec))
+   continue;

Can you add some test cases to your changes to show the effect of it? It
seem to me redundant keys are already eliminated at this point by either
make_pathkeys_for_uniquekeys or even earlier for distinct on, but could
be I've missed something.

Along the lines I'm also curious about this part:

-   ListCell   *k;
-   List *exprs = NIL;
-
-   foreach(k, ec->ec_members)
-   {
-   EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
-   exprs = lappend(exprs, mem->em_expr);
-   }
-
-   result = lappend(result, makeUniqueKey(exprs, false, false));
+   EquivalenceMember *mem = (EquivalenceMember*) 
lfirst(list_head(ec->ec_members));

I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to
preserve it in unique keys?

> - the distinct_pathkeys may be NULL, even though there's a possibility for 
> skipping. But it wouldn't create the uniquekeys in this case. This makes the 
> planner not choose skip scans even though it could. For example in queries 
> that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is 
> constant, it's eliminated from regular pathkeys.

What would be the point of skipping if it's a constant?

> - to combat the issues mentioned earlier, there's now a check in 
> build_index_paths that checks if the query_pathkeys matches the 
> useful_pathkeys. Note that we have to use the path keys here rather than any 
> of the unique keys. The unique keys are only Expr nodes - they do not contain 
> the necessary information about ordering. Due to elimination of some constant 
> path keys, we have to search the attributes of the index to find the correct 
> prefix to use in skipping.

IIUC here you mean this function, right?

+ prefix = find_index_prefix_for_pathkey(root,
+
index,
+
BackwardScanDirection,
+
llast_node(PathKey,
+
root->distinct_pathkeys));

Doesn't it duplicate the job already done in build_index_pathkeys by
building those pathkeys again? If yes, probably it's possible to reuse
useful_pathkeys. Not sure about unordered indexes, but looks like
query_pathkeys should also match in this case.

Will also look at the follow up questions in the next email.




Re: Index Skip Scan (new UniqueKeys)

2020-07-11 Thread Dmitry Dolgov
> On Fri, Jul 10, 2020 at 05:03:37PM +, Floris Van Nee wrote:
>
> Also took another look at the patch now, and found a case of incorrect
> data. It looks related to the new way of creating the paths, as I
> can't recall seeing this in earlier versions.
>
> create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 
> 10) a, generate_series(1,100) b;
> create index on t1 (a,b,c);
>
> postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
>   QUERY PLAN
> ---
>  Sort  (cost=2.92..2.94 rows=10 width=20)
>Sort Key: a, b DESC, c
>->  Index Scan using t1_a_b_c_idx on t1  (cost=0.28..2.75 rows=10 width=20)
>  Skip scan: true
> (4 rows)

Good point, thanks for looking at this. With the latest planner version
there are indeed more possibilities to use skipping. It never occured to
me that some of those paths will still rely on index scan returning full
data set. I'll look in details and add verification to prevent putting
something like this on top of skip scan in the next version.




Re: Index Skip Scan (new UniqueKeys)

2020-07-11 Thread Dmitry Dolgov
> On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote:
>
> On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > * Btree-implementation contains btree specific code to implement amskip,
> >   introduced in the previous patch.
>
> The way that you're dealing with B-Tree tuples here needs to account
> for posting list tuples:
>
> > +   currItem = >currPos.items[so->currPos.lastItem];
> > +   itup = (IndexTuple) (so->currTuples + 
> > currItem->tupleOffset);
> > +   nextOffset = ItemPointerGetOffsetNumber(>t_tid);

Do you mean this last part with t_tid, which could also have a tid array
in case of posting tuple format?

> > +   /*
> > +* To check if we returned the same tuple, try to find a
> > +* startItup on the current page. For that we need to update
> > +* scankey to match the whole tuple and set nextkey to 
> > return
> > +* an exact tuple, not the next one. If the nextOffset is 
> > the
> > +* same as before, it means we are in the loop, return 
> > offnum
> > +* to the original position and jump further
> > +*/
>
> Why does it make sense to use the offset number like this? It isn't
> stable or reliable. The patch goes on to do this:
>
> > +   startOffset = _bt_binsrch(scan->indexRelation,
> > + so->skipScanKey,
> > + so->currPos.buf);
> > +
> > +   page = BufferGetPage(so->currPos.buf);
> > +   maxoff = PageGetMaxOffsetNumber(page);
> > +
> > +   if (nextOffset <= startOffset)
> > +   {
>
> Why compare a heap TID's offset number (an offset number for a heap
> page) to another offset number for a B-Tree leaf page? They're
> fundamentally different things.

Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.




RE: Index Skip Scan (new UniqueKeys)

2020-07-10 Thread Floris Van Nee
Hi Dmitry,

Also took another look at the patch now, and found a case of incorrect data. It 
looks related to the new way of creating the paths, as I can't recall seeing 
this in earlier versions.

create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 
10) a, generate_series(1,100) b;
create index on t1 (a,b,c);

postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
  QUERY PLAN   
---
 Sort  (cost=2.92..2.94 rows=10 width=20)
   Sort Key: a, b DESC, c
   ->  Index Scan using t1_a_b_c_idx on t1  (cost=0.28..2.75 rows=10 width=20)
 Skip scan: true
(4 rows)


With the 'order by a, b desc, c' we expect the value of column 'b' to always be 
100. With index_skipscan enabled, it always gives 1 though. It's not correct 
that the planner chooses a skip scan followed by sort in this case.

-Floris





Re: Index Skip Scan (new UniqueKeys)

2020-07-08 Thread Peter Geoghegan
On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> * Btree-implementation contains btree specific code to implement amskip,
>   introduced in the previous patch.

The way that you're dealing with B-Tree tuples here needs to account
for posting list tuples:

> +   currItem = >currPos.items[so->currPos.lastItem];
> +   itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
> +   nextOffset = ItemPointerGetOffsetNumber(>t_tid);

But I wonder more generally what the idea here is. The following
comments that immediately follow provide some hints:

> +   /*
> +* To check if we returned the same tuple, try to find a
> +* startItup on the current page. For that we need to update
> +* scankey to match the whole tuple and set nextkey to return
> +* an exact tuple, not the next one. If the nextOffset is the
> +* same as before, it means we are in the loop, return offnum
> +* to the original position and jump further
> +*/

Why does it make sense to use the offset number like this? It isn't
stable or reliable. The patch goes on to do this:

> +   startOffset = _bt_binsrch(scan->indexRelation,
> + so->skipScanKey,
> + so->currPos.buf);
> +
> +   page = BufferGetPage(so->currPos.buf);
> +   maxoff = PageGetMaxOffsetNumber(page);
> +
> +   if (nextOffset <= startOffset)
> +   {

Why compare a heap TID's offset number (an offset number for a heap
page) to another offset number for a B-Tree leaf page? They're
fundamentally different things.

-- 
Peter Geoghegan




Re: Index Skip Scan (new UniqueKeys)

2020-06-29 Thread Dmitry Dolgov
> On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote:
>
> I just get the rough idea of patch, looks we have to narrow down the
> user cases where we can use this method. Consider the below example:

Hi

Not exactly narrow down, but rather get rid of wrong usage of skipping
for index scan. Since skipping for it was added later than for index
only scan I can imagine there are still blind spots, so good that you've
looked. In this particular case, when index expressions do not fully
cover those expressionse result need to be distinct on, skipping just
doesn't have enough information and should not be used. I'll add it to
the next version, thanks!




Re: Index Skip Scan (new UniqueKeys)

2020-06-11 Thread Andy Fan
On Tue, Jun 9, 2020 at 6:20 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> Hi,
>
> Here is a new version of index skip scan patch, based on v8 patch for
> UniqueKeys implementation from [1]. I want to start a new thread to
> simplify navigation, hopefully I didn't forget anyone who actively
> participated in the discussion.
>
> To simplify reviewing I've split it into several parts:
>
> * First two are taken from [1] just for the reference and to make cfbot
> happy.
>
> * Extend-UniqueKeys consists changes that needs to be done for
>   UniqueKeys to be used in skip scan. Essentially this is a reduced
>   version of previous implementation from Jesper & David, based on the
>   new UniqueKeys infrastructure, as it does the very same thing.
>
> * Index-Skip-Scan contains not am specific code and the overall
>   infrastructure, including configuration elements and so on.
>
> * Btree-implementation contains btree specific code to implement amskip,
>   introduced in the previous patch.
>
> * The last one contains just documentation bits.
>
> Interesting enough with a new UniqueKey implementation skipping is
> applied in some tests unexpectedly. For those I've disabled
> indexskipscan to avoid confusion.
>
> [1]:
> https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com
>

Thanks for the patch.

I just get the rough idea of patch, looks we have to narrow down the user
cases
where we can use this method.  Consider the below example:

create table j1(i int, im5 int,  im100 int, im1000 int);
insert into j1 select i, i%5, i%100, i%1000 from generate_series(1,
1000)i;
create index on j1(im5, i);
insert into j1 values (1, 1, 0, 0);
analyze j1;

demo=# select distinct * from j1 where i < 2;
 i | im5 | im100 | im1000
---+-+---+
 1 |   1 | 1 |  1
(1 row)

demo=# set enable_indexskipscan to off;
SET
demo=# select distinct * from j1 where i < 2;
 i | im5 | im100 | im1000
---+-+---+
 1 |   1 | 0 |  0
 1 |   1 | 1 |  1
(2 rows)

drop index j1_im5_i_idx;

create index on j1(im5, i, im100);
demo=# select distinct im5, i, im100 from j1 where i < 2;
 im5 | i | im100
-+---+---
   1 | 1 | 0
   1 | 1 | 1
(2 rows)
demo=# set enable_indexskipscan to on;
SET
demo=# select distinct im5, i, im100 from j1 where i < 2;
 im5 | i | im100
-+---+---
   1 | 1 | 0
(1 row)


-- 
Best Regards
Andy Fan


Re: Index Skip Scan

2020-06-02 Thread Andy Fan
On Tue, Jun 2, 2020 at 9:38 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote:
> >
> > > Other than that to summarize current open points for future readers
> > > (this thread somehow became quite big):
> > >
> > > * Making UniqueKeys usage more generic to allow using skip scan for
> more
> > >   use cases (hopefully it was covered by the v33, but I still need a
> > >   confirmation from David, like blinking twice or something).
> > >
> > > * Suspicious performance difference between different type of workload,
> > >   mentioned by Tomas (unfortunately I had no chance yet to
> investigate).
> > >
> > > * Thinking about supporting conditions, that are not covered by the
> index,
> > >   to make skipping more flexible (one of the potential next steps in
> the
> > >   future, as suggested by Floris).
> > >
> >
> > Looks this is the latest patch,  which commit it is based on?  Thanks
>
> I have a rebased version, if you're about it. Didn't posted it yet
> mostly since I'm in the middle of adapting it to the UniqueKeys from
> other thread. Would it be ok for you to wait a bit until I'll post
> finished version?
>

Sure, that's OK.  The discussion on UniqueKey thread looks more complex
than what I expected, that's why I want to check the code here, but that's
fine,
you can work on your schedule.

-- 
Best Regards
Andy Fan


Re: Index Skip Scan

2020-06-02 Thread Dmitry Dolgov
> On Tue, Jun 02, 2020 at 08:36:31PM +0800, Andy Fan wrote:
>
> > Other than that to summarize current open points for future readers
> > (this thread somehow became quite big):
> >
> > * Making UniqueKeys usage more generic to allow using skip scan for more
> >   use cases (hopefully it was covered by the v33, but I still need a
> >   confirmation from David, like blinking twice or something).
> >
> > * Suspicious performance difference between different type of workload,
> >   mentioned by Tomas (unfortunately I had no chance yet to investigate).
> >
> > * Thinking about supporting conditions, that are not covered by the index,
> >   to make skipping more flexible (one of the potential next steps in the
> >   future, as suggested by Floris).
> >
>
> Looks this is the latest patch,  which commit it is based on?  Thanks

I have a rebased version, if you're about it. Didn't posted it yet
mostly since I'm in the middle of adapting it to the UniqueKeys from
other thread. Would it be ok for you to wait a bit until I'll post
finished version?




Re: Index Skip Scan

2020-06-02 Thread Andy Fan
On Wed, Apr 8, 2020 at 3:41 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Mon, Apr 06, 2020 at 06:31:08PM +, Floris Van Nee wrote:
> >
> > > Hm, I wasn't aware about this one, thanks for bringing this up. Btw,
> Floris, I
> > > would appreciate if in the future you can make it more visible that
> changes you
> > > suggest contain some fixes. E.g. it wasn't clear for me from your
> previous email
> > > that that's the case, and it doesn't make sense to pull into different
> direction
> > > when we're trying to achieve the same goal :)
> >
> > I wasn't aware that this particular case could be triggered before I saw
> Dilip's email, otherwise I'd have mentioned it here of course. It's just
> that because my patch handles filter conditions in general, it works for
> this case too.
>
> Oh, then fortunately I've got a wrong impression, sorry and thanks for
> clarification :)
>
> > > > > In the patch I posted a week ago these cases are all handled
> > > > > correctly, as it introduces this extra logic in the Executor.
> > > >
> > > > Okay, So I think we can merge those fixes in Dmitry's patch set.
> > >
> > > I'll definitely take a look at suggested changes in filtering part.
> >
> > It may be possible to just merge the filtering part into your patch, but
> I'm not entirely sure. Basically you have to pull the information about
> skipping one level up, out of the node, into the generic IndexNext code.
>
> I was actually thinking more about just preventing skip scan in this
> situation, which is if I'm not mistaken could be solved by inspecting
> qual conditions to figure out if they're covered in the index -
> something like in attachments (this implementation is actually too
> restrictive in this sense and will not allow e.g. expressions, that's
> why I haven't bumped patch set version for it - soon I'll post an
> extended version).
>
> Other than that to summarize current open points for future readers
> (this thread somehow became quite big):
>
> * Making UniqueKeys usage more generic to allow using skip scan for more
>   use cases (hopefully it was covered by the v33, but I still need a
>   confirmation from David, like blinking twice or something).
>
> * Suspicious performance difference between different type of workload,
>   mentioned by Tomas (unfortunately I had no chance yet to investigate).
>
> * Thinking about supporting conditions, that are not covered by the index,
>   to make skipping more flexible (one of the potential next steps in the
>   future, as suggested by Floris).
>

Looks this is the latest patch,  which commit it is based on?  Thanks

-- 
Best Regards
Andy Fan


Re: Index Skip Scan

2020-05-15 Thread Dilip Kumar
On Fri, 15 May 2020 at 6:06 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote:
> >
> > + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf,
> dir))
> > + {
> >
> > Here we expect whether the "next" unique key can be found on this page
> > or not, but we are using the function which suggested whether the
> > "current" key can be found on this page or not.  I think in boundary
> > cases where the high key is equal to the current key, this function
> > will return true (which is expected from this function), and based on
> > that we will simply scan the current page and IMHO that cost could be
> > avoided no?
>
> Yes, looks like you're right, there is indeed an unecessary extra scan
> happening. To avoid that we can see the key->nextkey and adjust higher
> boundary correspondingly. Will also add this into the next rebased
> patch, thanks!


Great thanks!

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Index Skip Scan

2020-05-15 Thread Dmitry Dolgov
> On Wed, May 13, 2020 at 02:37:21PM +0530, Dilip Kumar wrote:
>
> + if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
> + {
>
> Here we expect whether the "next" unique key can be found on this page
> or not, but we are using the function which suggested whether the
> "current" key can be found on this page or not.  I think in boundary
> cases where the high key is equal to the current key, this function
> will return true (which is expected from this function), and based on
> that we will simply scan the current page and IMHO that cost could be
> avoided no?

Yes, looks like you're right, there is indeed an unecessary extra scan
happening. To avoid that we can see the key->nextkey and adjust higher
boundary correspondingly. Will also add this into the next rebased
patch, thanks!




Re: Index Skip Scan

2020-05-13 Thread Peter Geoghegan
On Mon, Jan 20, 2020 at 5:05 PM Peter Geoghegan  wrote:
> You can add another assertion that calls a new utility function in
> bufmgr.c. That can use the same logic as this existing assertion in
> FlushOneBuffer():
>
> Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
>
> We haven't needed assertions like this so far because it's usually it
> is clear whether or not a buffer lock is held (plus the bufmgr.c
> assertions help on their own).

Just in case anybody missed it, I am working on a patch that makes
nbtree use Valgrind instrumentation to detect page accessed without a
buffer content lock held:

https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpwrub_45eexlh1+k...@mail.gmail.com

There is also one component that detects when any buffer is accessed
without a buffer pin.

-- 
Peter Geoghegan




Re: Index Skip Scan

2020-05-13 Thread Dilip Kumar
On Mon, May 11, 2020 at 4:55 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote:
> >
> > > > +static inline bool
> > > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
> > > > + Buffer buf, ScanDirection dir)
> > > > +{
> > > > + OffsetNumber low, high;
> > > > + Page page = BufferGetPage(buf);
> > > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> > > > +
> > > > + low = P_FIRSTDATAKEY(opaque);
> > > > + high = PageGetMaxOffsetNumber(page);
> > > > +
> > > > + if (unlikely(high < low))
> > > > + return false;
> > > > +
> > > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
> > > > + _bt_compare(scan->indexRelation, key, page, high) < 1);
> > > > +}
> > > >
> > > > I think the high key condition should be changed to
> > > > _bt_compare(scan->indexRelation, key, page, high) < 0 ?  Because if
> > > > prefix qual is equal to the high key then also there is no point in
> > > > searching on the current page so we can directly skip.
> > >
> > > From nbtree/README and comments to functions like _bt_split I've got an
> > > impression that the high key could be equal to the last item on the leaf
> > > page, so there is a point in searching. Is that incorrect?
> >
> > But IIUC,  here we want to decide whether we will get the next key in
> > the current page or not?
>
> In general this function does what it says, it checks wether or not the
> provided scankey could be found within the page. All the logic about
> finding a proper next key to fetch is implemented on the call site, and
> within this function we want to test whatever was passed in. Does it
> answer the question?

Ok, I agree that the function is doing what it is expected to do.
But, then I have a problem with this call site.

+ /* Check if the next unique key can be found within the current page.
+ * Since we do not lock the current page between jumps, it's possible
+ * that it was splitted since the last time we saw it. This is fine in
+ * case of scanning forward, since page split to the right and we are
+ * still on the left most page. In case of scanning backwards it's
+ * possible to loose some pages and we need to remember the previous
+ * page, and then follow the right link from the current page until we
+ * find the original one.
+ *
+ * Since the whole idea of checking the current page is to protect
+ * ourselves and make more performant statistic mismatch case when
+ * there are too many distinct values for jumping, it's not clear if
+ * the complexity of this solution in case of backward scan is
+ * justified, so for now just avoid it.
+ */
+ if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+ {
+ LockBuffer(so->currPos.buf, BT_READ);
+
+ if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+ {

Here we expect whether the "next" unique key can be found on this page
or not, but we are using the function which suggested whether the
"current" key can be found on this page or not.  I think in boundary
cases where the high key is equal to the current key, this function
will return true (which is expected from this function), and based on
that we will simply scan the current page and IMHO that cost could be
avoided no?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Index Skip Scan

2020-05-11 Thread Dmitry Dolgov
> On Mon, May 11, 2020 at 04:04:00PM +0530, Dilip Kumar wrote:
>
> > > +static inline bool
> > > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
> > > + Buffer buf, ScanDirection dir)
> > > +{
> > > + OffsetNumber low, high;
> > > + Page page = BufferGetPage(buf);
> > > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> > > +
> > > + low = P_FIRSTDATAKEY(opaque);
> > > + high = PageGetMaxOffsetNumber(page);
> > > +
> > > + if (unlikely(high < low))
> > > + return false;
> > > +
> > > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
> > > + _bt_compare(scan->indexRelation, key, page, high) < 1);
> > > +}
> > >
> > > I think the high key condition should be changed to
> > > _bt_compare(scan->indexRelation, key, page, high) < 0 ?  Because if
> > > prefix qual is equal to the high key then also there is no point in
> > > searching on the current page so we can directly skip.
> >
> > From nbtree/README and comments to functions like _bt_split I've got an
> > impression that the high key could be equal to the last item on the leaf
> > page, so there is a point in searching. Is that incorrect?
>
> But IIUC,  here we want to decide whether we will get the next key in
> the current page or not?

In general this function does what it says, it checks wether or not the
provided scankey could be found within the page. All the logic about
finding a proper next key to fetch is implemented on the call site, and
within this function we want to test whatever was passed in. Does it
answer the question?




Re: Index Skip Scan

2020-05-11 Thread Dilip Kumar
On Sun, May 10, 2020 at 11:17 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote:
> >
> > Some more comments...
>
> Thanks for reviewing. Since this patch took much longer than I expected,
> it's useful to have someone to look at it with a "fresh eyes".
>
> > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
> > + _bt_update_skip_scankeys(scan, indexRel);
> > +
> > ...
> > + /*
> > + * We haven't found scan key within the current page, so let's scan from
> > + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
> > + * number
> > + */
> > + so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
> > + stack = _bt_search(scan->indexRelation, so->skipScanKey,
> > +, BT_READ, scan->xs_snapshot);
> >
> > Why do we need to set so->skipScanKey->nextkey =
> > ScanDirectionIsForward(dir); multiple times? I think it is fine to
> > just set it once?
>
> I believe it was necessary for previous implementations, but in the
> current version we can avoid this, you're right.
>
> > +static inline bool
> > +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
> > + Buffer buf, ScanDirection dir)
> > +{
> > + OffsetNumber low, high;
> > + Page page = BufferGetPage(buf);
> > + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> > +
> > + low = P_FIRSTDATAKEY(opaque);
> > + high = PageGetMaxOffsetNumber(page);
> > +
> > + if (unlikely(high < low))
> > + return false;
> > +
> > + return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
> > + _bt_compare(scan->indexRelation, key, page, high) < 1);
> > +}
> >
> > I think the high key condition should be changed to
> > _bt_compare(scan->indexRelation, key, page, high) < 0 ?  Because if
> > prefix qual is equal to the high key then also there is no point in
> > searching on the current page so we can directly skip.
>
> From nbtree/README and comments to functions like _bt_split I've got an
> impression that the high key could be equal to the last item on the leaf
> page, so there is a point in searching. Is that incorrect?

But IIUC,  here we want to decide whether we will get the next key in
the current page or not? Is my understanding is correct?   So if our
key (the last tuple key) is equal to the high key means the max key on
this page is the same as what we already got in the last tuple so why
would we want to go on this page? because this will not give us the
new key.  So ideally, we should only be looking into this page if our
last tuple key is smaller than the high key.  Am I missing something?

>
> > + /* Check if an index skip scan is possible. */
> > + can_skip = enable_indexskipscan & index->amcanskip;
> > +
> > + /*
> > + * Skip scan is not supported when there are qual conditions, which are not
> > + * covered by index. The reason for that is that those conditions are
> > + * evaluated later, already after skipping was applied.
> > + *
> > + * TODO: This implementation is too restrictive, and doesn't allow e.g.
> > + * index expressions. For that we need to examine index_clauses too.
> > + */
> > + if (root->parse->jointree != NULL)
> > + {
> > + ListCell *lc;
> > +
> > + foreach(lc, (List *)root->parse->jointree->quals)
> > + {
> > + Node *expr, *qual = (Node *) lfirst(lc);
> > + Var *var;
> >
> > I think we can avoid checking for expression if can_skip is already false.
>
> Yes, that makes sense. I'll include your suggestions into the next
> rebased version I'm preparing.

Ok.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Index Skip Scan

2020-05-10 Thread Dmitry Dolgov
> On Sat, Apr 11, 2020 at 03:17:25PM +0530, Dilip Kumar wrote:
>
> Some more comments...

Thanks for reviewing. Since this patch took much longer than I expected,
it's useful to have someone to look at it with a "fresh eyes".

> + so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
> + _bt_update_skip_scankeys(scan, indexRel);
> +
> ...
> + /*
> + * We haven't found scan key within the current page, so let's scan from
> + * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
> + * number
> + */
> + so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
> + stack = _bt_search(scan->indexRelation, so->skipScanKey,
> +, BT_READ, scan->xs_snapshot);
>
> Why do we need to set so->skipScanKey->nextkey =
> ScanDirectionIsForward(dir); multiple times? I think it is fine to
> just set it once?

I believe it was necessary for previous implementations, but in the
current version we can avoid this, you're right.

> +static inline bool
> +_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
> + Buffer buf, ScanDirection dir)
> +{
> + OffsetNumber low, high;
> + Page page = BufferGetPage(buf);
> + BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> +
> + low = P_FIRSTDATAKEY(opaque);
> + high = PageGetMaxOffsetNumber(page);
> +
> + if (unlikely(high < low))
> + return false;
> +
> + return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
> + _bt_compare(scan->indexRelation, key, page, high) < 1);
> +}
>
> I think the high key condition should be changed to
> _bt_compare(scan->indexRelation, key, page, high) < 0 ?  Because if
> prefix qual is equal to the high key then also there is no point in
> searching on the current page so we can directly skip.

>From nbtree/README and comments to functions like _bt_split I've got an
impression that the high key could be equal to the last item on the leaf
page, so there is a point in searching. Is that incorrect?

> + /* Check if an index skip scan is possible. */
> + can_skip = enable_indexskipscan & index->amcanskip;
> +
> + /*
> + * Skip scan is not supported when there are qual conditions, which are not
> + * covered by index. The reason for that is that those conditions are
> + * evaluated later, already after skipping was applied.
> + *
> + * TODO: This implementation is too restrictive, and doesn't allow e.g.
> + * index expressions. For that we need to examine index_clauses too.
> + */
> + if (root->parse->jointree != NULL)
> + {
> + ListCell *lc;
> +
> + foreach(lc, (List *)root->parse->jointree->quals)
> + {
> + Node *expr, *qual = (Node *) lfirst(lc);
> + Var *var;
>
> I think we can avoid checking for expression if can_skip is already false.

Yes, that makes sense. I'll include your suggestions into the next
rebased version I'm preparing.




Re: Index Skip Scan

2020-05-10 Thread Dmitry Dolgov
Sorry for late reply.

> On Tue, Apr 14, 2020 at 09:19:22PM +1200, David Rowley wrote:
>
> I've not yet looked at the latest patch, but I did put some thoughts
> into an email on the other thread that's been discussing UniqueKeys
> [1].
>
> I'm keen to hear thoughts on the plan I mentioned over there.  Likely
> it would be best to discuss the specifics of what additional features
> we need to add to UniqueKeys for skip scans over here, but discuss any
> chances which affect both patches over there.  We certainly can't have
> two separate implementations of UniqueKeys, so I believe the skip
> scans UniqueKeys patch should most likely be based on the one in [1]
> or some descendant of it.
>
> [1] 
> https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com

Yes, I've come to the same conclusion, although I have my concerns about
having such a dependency between patches. Will look at the suggested
patches soon.




Re: Index Skip Scan

2020-04-14 Thread David Rowley
On Wed, 8 Apr 2020 at 07:40, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Other than that to summarize current open points for future readers
> (this thread somehow became quite big):
>
> * Making UniqueKeys usage more generic to allow using skip scan for more
>   use cases (hopefully it was covered by the v33, but I still need a
>   confirmation from David, like blinking twice or something).

I've not yet looked at the latest patch, but I did put some thoughts
into an email on the other thread that's been discussing UniqueKeys
[1].

I'm keen to hear thoughts on the plan I mentioned over there.  Likely
it would be best to discuss the specifics of what additional features
we need to add to UniqueKeys for skip scans over here, but discuss any
chances which affect both patches over there.  We certainly can't have
two separate implementations of UniqueKeys, so I believe the skip
scans UniqueKeys patch should most likely be based on the one in [1]
or some descendant of it.

[1] 
https://www.postgresql.org/message-id/CAApHDvpx1qED1uLqubcKJ=ohatcmd7ptukkdq0b72_08nbr...@mail.gmail.com




RE: Index Skip Scan

2020-04-07 Thread Floris Van Nee
> 
> * Suspicious performance difference between different type of workload,
>   mentioned by Tomas (unfortunately I had no chance yet to investigate).
> 

His benchmark results indeed most likely point to multiple comparisons being 
done. Since the most likely place where these occur is _bt_readpage, I suspect 
this is called multiple times. Looking at your patch, I think that's indeed the 
case. For example, suppose a page contains [1,2,3,4,5] and the planner makes a 
complete misestimation and chooses a skip scan here. First call to _bt_readpage 
will compare every tuple on the page already and store everything in the 
workspace, which will now contain [1,2,3,4,5]. However, when a skip is done the 
elements on the page (not the workspace) are compared to find the next one. 
Then, another _bt_readpage is done, starting at the new offnum. So we'll 
compare every tuple (except 1) on the page again. Workspace now contains 
[2,3,4,5]. Next tuple we'll end up with [3,4,5] etc. So tuple 5 actually gets 
compared 5 times in _bt_readpage alone.

-Floris





Re: Index Skip Scan

2020-04-07 Thread Dmitry Dolgov
ove), describing the sort
  * ordering of the path's output rows.
+ *
+ * "uniquekeys", if not NIL, is a list of UniqueKey nodes (see above),
+ * describing the XXX.
  */
 typedef struct Path
 {
@@ -1129,6 +1146,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h
index 6126b491bf..006248bfb5 100644
--- a/src/include/nodes/print.h
+++ b/src/include/nodes/print.h
@@ -28,6 +28,7 @@ extern char *pretty_format_node_dump(const char *dump);
 extern void print_rt(const List *rtable);
 extern void print_expr(const Node *expr, const List *rtable);
 extern void print_pathkeys(const List *pathkeys, const List *rtable);
+extern void print_uniquekeys(const List *uniquekeys, const List *rtable);
 extern void print_tl(const List *tlist, const List *rtable);
 extern void print_slot(TupleTableSlot *slot);
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..fd25997af5 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -34,6 +34,7 @@ extern void add_partial_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_partial_path_precheck(RelOptInfo *parent_rel,
 	  Cost total_cost, List *pathkeys);
 
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
  Relids required_outer, int parallel_workers);
 extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel,
@@ -44,6 +45,7 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 	List *indexorderbys,
 	List *indexorderbycols,
 	List *pathkeys,
+	List *uniquekeys,
 	ScanDirection indexscandir,
 	bool indexonly,
 	Relids required_outer,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..5b6be383b3 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -214,6 +214,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 		   List *sortclauses,
 		   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+		  List *sortclauses,
+		  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 			RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -240,4 +243,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	List *live_childrels);
 
+/*
+ * uniquekey.c
+ *	  Utilities for matching and building unique keys
+ */
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+extern bool uniquekeys_contained_in(List *keys1, List *keys2);
+extern bool has_useful_uniquekeys(PlannerInfo *root);
+
 #endif			/* PATHS_H */
-- 
2.21.0

>From a44f8bf868133f0a342bf8de477d650a455efac7 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 15 Nov 2019 09:46:53 -0500
Subject: [PATCH v33 2/2] Index skip scan

Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c   |   1 +
 doc/src/sgml/config.sgml  |  15 +
 doc/src/sgml/indexam.sgml |  63 ++
 doc/src/sgml/indices.sgml |  23 +
 src/backend/access/brin/brin.c|   1 +
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/hash/hash.c|   1 +
 src/backend/access/index/indexam.c|  18 +
 src/backend/access/nbtree/nbtree.c|  13 +
 src/backend/access/nbtree/nbtsearch.c | 469 -
 src/backend/access/spgist/spgutils.c  |   1 +
 src/backend/commands/explain.c

RE: Index Skip Scan

2020-04-06 Thread Floris Van Nee


> 
> Hm, I wasn't aware about this one, thanks for bringing this up. Btw, Floris, I
> would appreciate if in the future you can make it more visible that changes 
> you
> suggest contain some fixes. E.g. it wasn't clear for me from your previous 
> email
> that that's the case, and it doesn't make sense to pull into different 
> direction
> when we're trying to achieve the same goal :)

I wasn't aware that this particular case could be triggered before I saw 
Dilip's email, otherwise I'd have mentioned it here of course. It's just that 
because my patch handles filter conditions in general, it works for this case 
too.

> 
> > > In the patch I posted a week ago these cases are all handled
> > > correctly, as it introduces this extra logic in the Executor.
> >
> > Okay, So I think we can merge those fixes in Dmitry's patch set.
> 
> I'll definitely take a look at suggested changes in filtering part.

It may be possible to just merge the filtering part into your patch, but I'm 
not entirely sure. Basically you have to pull the information about skipping 
one level up, out of the node, into the generic IndexNext code. 

I'm eager to get some form of skip scans into master - any kind of patch that 
makes this possible is fine by me. Long term I think my version provides a more 
generic approach, with which we can optimize a much broader range of queries. 
However, since many more eyes have seen your patch so far, I hope yours can be 
committed much sooner. My knowledge on this committer process is limited 
though. That's why I've just posted mine so far in the hope of collecting some 
feedback, also on how we should continue with the process.

-Floris






Re: Index Skip Scan

2020-04-06 Thread Dmitry Dolgov
> On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee  
> wrote:
> >
> > There's a number of ways we can force a 'filter' rather than an
> > 'index condition'.

Hm, I wasn't aware about this one, thanks for bringing this up. Btw,
Floris, I would appreciate if in the future you can make it more visible
that changes you suggest contain some fixes. E.g. it wasn't clear for me
from your previous email that that's the case, and it doesn't make sense
to pull into different direction when we're trying to achieve the same
goal :)

> > In the patch I posted a week ago these cases are all handled
> > correctly, as it introduces this extra logic in the Executor.
>
> Okay, So I think we can merge those fixes in Dmitry's patch set.

I'll definitely take a look at suggested changes in filtering part.




Re: Index Skip Scan

2020-04-06 Thread Dilip Kumar
On Mon, Apr 6, 2020 at 1:14 PM Floris Van Nee  wrote:
>
> > > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
> > > >
> > > > I was just wondering how the distinct will work with the "skip scan"
> > > > if we have some filter? I mean every time we select the unique row
> > > > based on the prefix key and that might get rejected by an external
> > > > filter right?
> > >
>
> Yeah, you're correct. This patch only handles the index conditions and 
> doesn't handle any filters correctly. There's a check in the planner for the 
> IndexScan for example that only columns that exist in the index are used. 
> However, this check is not sufficient as your example shows. There's a number 
> of ways we can force a 'filter' rather than an 'index condition' and still 
> choose a skip scan (WHERE b!=0 is another one I think). This leads to 
> incorrect query results.

Right

> This patch would need some logic in the planner to never choose the skip scan 
> in these cases. Better long-term solution is to adapt the rest of the 
> executor to work correctly in the cases of external filters (this ties in 
> with the previous visibility discussion as well, as that's basically also an 
> external filter, although a special case).

I agree

> In the patch I posted a week ago these cases are all handled correctly, as it 
> introduces this extra logic in the Executor.

Okay, So I think we can merge those fixes in Dmitry's patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




RE: Index Skip Scan

2020-04-06 Thread Floris Van Nee
> > > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
> > >
> > > I was just wondering how the distinct will work with the "skip scan"
> > > if we have some filter? I mean every time we select the unique row
> > > based on the prefix key and that might get rejected by an external
> > > filter right?
> >

Yeah, you're correct. This patch only handles the index conditions and doesn't 
handle any filters correctly. There's a check in the planner for the IndexScan 
for example that only columns that exist in the index are used. However, this 
check is not sufficient as your example shows. There's a number of ways we can 
force a 'filter' rather than an 'index condition' and still choose a skip scan 
(WHERE b!=0 is another one I think). This leads to incorrect query results.

This patch would need some logic in the planner to never choose the skip scan 
in these cases. Better long-term solution is to adapt the rest of the executor 
to work correctly in the cases of external filters (this ties in with the 
previous visibility discussion as well, as that's basically also an external 
filter, although a special case).
In the patch I posted a week ago these cases are all handled correctly, as it 
introduces this extra logic in the Executor.

-Floris




Re: Index Skip Scan

2020-04-05 Thread Dilip Kumar
On Sun, Apr 5, 2020 at 9:39 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
> >
> > I was just wondering how the distinct will work with the "skip scan"
> > if we have some filter? I mean every time we select the unique row
> > based on the prefix key and that might get rejected by an external
> > filter right?
>
> Not exactly. In the case of index-only scan, we skipping to the first
> unique position, and then use already existing functionality
> (_bt_readpage with stepping to the next pages) to filter out those
> records that do not pass the condition.

I agree but that will work if we have a valid index clause, but
"b%100=0" condition will not create an index clause, right?  However,
if we change the query to
select distinct(a) from t where b=100 then it works fine because this
condition will create an index clause.

 There are even couple of tests
> in the patch for this. In case of index scan, when there are some
> conditions, current implementation do not consider skipping.
>
> > So I tried an example to check this.
>
> Can you tell on which version of the patch you were testing?

I have tested on v33.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Index Skip Scan

2020-04-05 Thread Dmitry Dolgov
> On Sun, Apr 05, 2020 at 04:30:51PM +0530, Dilip Kumar wrote:
>
> I was just wondering how the distinct will work with the "skip scan"
> if we have some filter? I mean every time we select the unique row
> based on the prefix key and that might get rejected by an external
> filter right?

Not exactly. In the case of index-only scan, we skipping to the first
unique position, and then use already existing functionality
(_bt_readpage with stepping to the next pages) to filter out those
records that do not pass the condition. There are even couple of tests
in the patch for this. In case of index scan, when there are some
conditions, current implementation do not consider skipping.

> So I tried an example to check this.

Can you tell on which version of the patch you were testing?




Re: Index Skip Scan

2020-04-05 Thread Dilip Kumar
On Wed, Mar 25, 2020 at 2:19 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote:
> >
> > Seems like you forgot to add the uniquekey.c file in the
> > v33-0001-Unique-key.patch.
>
> Oh, you're right, thanks. Here is the corrected patch.

I was just wondering how the distinct will work with the "skip scan"
if we have some filter? I mean every time we select the unique row
based on the prefix key and that might get rejected by an external
filter right?  So I tried an example to check this.

postgres[50006]=# \d t
 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Indexes:
"idx" btree (a, b)

postgres[50006]=# insert into t select 2, i from generate_series(1, 200)i;
INSERT 0 200
postgres[50006]=# insert into t select 1, i from generate_series(1, 200)i;
INSERT 0 200

postgres[50006]=# set enable_indexskipscan =off;
SET
postgres[50006]=# select distinct(a) from t where b%100=0;
 a
---
 1
 2
(2 rows)

postgres[50006]=# set enable_indexskipscan =on;
SET
postgres[50006]=# select distinct(a) from t where b%100=0;
 a
---
(0 rows)

postgres[50006]=# explain select distinct(a) from t where b%100=0;
QUERY PLAN
---
 Index Only Scan using idx on t  (cost=0.15..1.55 rows=10 width=4)
   Skip scan: true
   Filter: ((b % 100) = 0)
(3 rows)

I think in such cases we should not select the skip scan.  This should
behave like we have a filter on the non-index field.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Index Skip Scan

2020-03-25 Thread Dmitry Dolgov
*root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 		   List *sortclauses,
 		   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+		  List *sortclauses,
+		  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 			RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -240,4 +243,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	List *live_childrels);
 
+/*
+ * uniquekey.c
+ *	  Utilities for matching and building unique keys
+ */
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+extern bool uniquekeys_contained_in(List *keys1, List *keys2);
+extern bool has_useful_uniquekeys(PlannerInfo *root);
+
 #endif			/* PATHS_H */
-- 
2.21.0

>From 666a8095b700365f78de80fa0febc4a0ac24ae7a Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 15 Nov 2019 09:46:53 -0500
Subject: [PATCH v33 2/2] Index skip scan

Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c   |   1 +
 doc/src/sgml/config.sgml  |  15 +
 doc/src/sgml/indexam.sgml |  63 ++
 doc/src/sgml/indices.sgml |  23 +
 src/backend/access/brin/brin.c|   1 +
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/hash/hash.c|   1 +
 src/backend/access/index/indexam.c|  18 +
 src/backend/access/nbtree/nbtree.c|  13 +
 src/backend/access/nbtree/nbtsearch.c | 469 -
 src/backend/access/spgist/spgutils.c  |   1 +
 src/backend/commands/explain.c|  25 +
 src/backend/executor/nodeIndexonlyscan.c  |  97 ++-
 src/backend/executor/nodeIndexscan.c  |  56 +-
 src/backend/nodes/copyfuncs.c |   2 +
 src/backend/nodes/outfuncs.c  |   2 +
 src/backend/nodes/readfuncs.c |   2 +
 src/backend/optimizer/path/costsize.c |   1 +
 src/backend/optimizer/path/indxpath.c |  37 ++
 src/backend/optimizer/plan/createplan.c   |  20 +-
 src/backend/optimizer/plan/planner.c  |  10 +-
 src/backend/optimizer/util/pathnode.c |  68 ++
 src/backend/optimizer/util/plancat.c  |   1 +
 src/backend/utils/misc/guc.c  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h|   8 +
 src/include/access/genam.h|   2 +
 src/include/access/nbtree.h   |   7 +
 src/include/access/sdir.h |   7 +
 src/include/nodes/execnodes.h |   6 +
 src/include/nodes/pathnodes.h |   5 +
 src/include/nodes/plannodes.h |   4 +
 src/include/optimizer/cost.h  |   1 +
 src/include/optimizer/pathnode.h  |   3 +
 src/test/regress/expected/select_distinct.out | 621 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/select_distinct.sql  | 254 +++
 38 files changed, 1845 insertions(+), 14 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 0104d02f67..a018b7f3d0 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -133,6 +133,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e07dc01e80..36ba75b077 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4517,6 +4517,21 @@ ANY num_sync ( 
+  enable_indexskipscan (boolean)
+  
+   enable_indexskipscan configuration parameter
+  
+  
+  
+   
+Enables 

Re: Index Skip Scan

2020-03-25 Thread Dilip Kumar
On Tue, Mar 24, 2020 at 10:08 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote:
> >
> > Yes, I was complaining that a ProjectionPath breaks the optimisation
> > and I don't believe there's any reason that it should.
> >
> > I believe the way to make that work correctly requires paying
> > attention to the Path's uniquekeys rather than what type of path it
> > is.
>
> Thanks for the suggestion. As a result of the discussion I've modified
> the patch, does it look similar to what you had in mind?
>
> In this version if all conditions are met and there are corresponding
> unique keys, a new index skip scan path will be added to
> unique_pathlists. In case if requested distinct clauses match with
> unique keys, create_distinct_paths can choose this path without needen
> to know what kind of path is it. Also unique_keys are passed through
> ProjectionPath, so optimization for the example mentioned in this thread
> before now should work (I've added one test for that).
>
> I haven't changed anything about UniqueKey structure itself (one of the
> suggestions was about Expr instead of EquivalenceClass), but I believe
> we need anyway to figure out how two existing imlementation (in this
> patch and from [1]) of this idea can be connected.
>
> [1]: 
> https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com

---
 src/backend/nodes/outfuncs.c  | 14 ++
 src/backend/nodes/print.c | 39 +++
 src/backend/optimizer/path/Makefile   |  3 +-
 src/backend/optimizer/path/allpaths.c |  8 +++
 src/backend/optimizer/path/indxpath.c | 41 
 src/backend/optimizer/path/pathkeys.c | 71 ++-
 src/backend/optimizer/plan/planagg.c  |  1 +
 src/backend/optimizer/plan/planmain.c |  1 +
 src/backend/optimizer/plan/planner.c  | 37 +-
 src/backend/optimizer/util/pathnode.c | 46 +
 src/include/nodes/nodes.h |  1 +
 src/include/nodes/pathnodes.h | 19 +++
 src/include/nodes/print.h |  1 +
 src/include/optimizer/pathnode.h  |  2 +
 src/include/optimizer/paths.h | 11 +
 15 files changed, 272 insertions(+), 23 deletions(-)

Seems like you forgot to add the uniquekey.c file in the
v33-0001-Unique-key.patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Index Skip Scan

2020-03-24 Thread Andy Fan
On Wed, Mar 25, 2020 at 12:41 AM Dmitry Dolgov <9erthali...@gmail.com>
wrote:

> > On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote:
> >
> > There was a dedicated thread  [1]  where David explain his idea very
> > detailed, and you can also check that messages around that message for
> > the context. hope it helps.
>
> Thanks for pointing out to this thread! Somehow I've missed it, and now
> looks like we need to make some efforts to align patches for index skip
> scan and distincClause elimination.
>

Yes:).   Looks Index skip scan is a way of make a distinct result without a
real
distinct node,  which happens after the UniqueKeys check where I try to see
if
the result is unique already and before the place where create a unique node
for distinct node(With index skip scan we don't need it all).  Currently in
my patch,
the logical here is 1).  Check the UniqueKey to see if the result is not
unique already.
if not, go to next 2).  After the distinct paths are created,  I will add
the result of distinct
path as a unique key.   Will you add the index skip scan path during
create_distincts_paths
and add the UniqueKey to RelOptInfo?  if so I guess my current patch can
handle it since
it cares about the result of distinct path but no worried about how we
archive that.


Best Regards
Andy Fan


Re: Index Skip Scan

2020-03-24 Thread Dmitry Dolgov
> On Wed, Mar 11, 2020 at 06:56:09PM +0800, Andy Fan wrote:
>
> There was a dedicated thread  [1]  where David explain his idea very
> detailed, and you can also check that messages around that message for
> the context. hope it helps.

Thanks for pointing out to this thread! Somehow I've missed it, and now
looks like we need to make some efforts to align patches for index skip
scan and distincClause elimination.




Re: Index Skip Scan

2020-03-24 Thread Dmitry Dolgov
> On Wed, Mar 11, 2020 at 11:17:51AM +1300, David Rowley wrote:
>
> Yes, I was complaining that a ProjectionPath breaks the optimisation
> and I don't believe there's any reason that it should.
>
> I believe the way to make that work correctly requires paying
> attention to the Path's uniquekeys rather than what type of path it
> is.

Thanks for the suggestion. As a result of the discussion I've modified
the patch, does it look similar to what you had in mind?

In this version if all conditions are met and there are corresponding
unique keys, a new index skip scan path will be added to
unique_pathlists. In case if requested distinct clauses match with
unique keys, create_distinct_paths can choose this path without needen
to know what kind of path is it. Also unique_keys are passed through
ProjectionPath, so optimization for the example mentioned in this thread
before now should work (I've added one test for that).

I haven't changed anything about UniqueKey structure itself (one of the
suggestions was about Expr instead of EquivalenceClass), but I believe
we need anyway to figure out how two existing imlementation (in this
patch and from [1]) of this idea can be connected.

[1]: 
https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com
>From c7af8157da82db1cedf02e6ec0de355b56275680 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Tue, 24 Mar 2020 17:04:32 +0100
Subject: [PATCH v33 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c  | 14 ++
 src/backend/nodes/print.c | 39 +++
 src/backend/optimizer/path/Makefile   |  3 +-
 src/backend/optimizer/path/allpaths.c |  8 +++
 src/backend/optimizer/path/indxpath.c | 41 
 src/backend/optimizer/path/pathkeys.c | 71 ++-
 src/backend/optimizer/plan/planagg.c  |  1 +
 src/backend/optimizer/plan/planmain.c |  1 +
 src/backend/optimizer/plan/planner.c  | 37 +-
 src/backend/optimizer/util/pathnode.c | 46 +
 src/include/nodes/nodes.h |  1 +
 src/include/nodes/pathnodes.h | 19 +++
 src/include/nodes/print.h |  1 +
 src/include/optimizer/pathnode.h  |  2 +
 src/include/optimizer/paths.h | 11 +
 15 files changed, 272 insertions(+), 23 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 42476724d8..d286b34544 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while (eclass->ec_merged)
+			eclass = eclass->ec_merged;
+
+		printf("(");
+		foreach(k, eclass->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(

Re: Index Skip Scan

2020-03-23 Thread Andy Fan
>
>
> On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee 
> wrote:
> > I'm unsure which version number to give this patch (to continue with
> numbers from previous skip scan patches, or to start numbering from scratch
> again). It's a rather big change, so one could argue it's mostly a separate
> patch. I guess it mostly depends on how close the original versions were to
> be committable. Thoughts?
>
> I don't know, but from the sidelines, it'd be nice to see the unique
> path part go into PG13, where IIUC it can power the "useless unique
> removal" patch.
>

Actually I have a patch to remove the distinct clause some long time ago[1],
and later it came to the UniqueKey as well,  you can see [2] for the current
status.

[1]
https://www.postgresql.org/message-id/flat/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com#f5d97cc66b9cd330add2fbb004a4d107

[2]
https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL=uafudmf4wgovkel3onbatqju9nsxtuucpp...@mail.gmail.com


Re: Index Skip Scan

2020-03-22 Thread Thomas Munro
Hi Floris,

On Sun, Mar 22, 2020 at 11:00 AM Floris Van Nee
 wrote:
> create index on t1 (a,b,c);

> select * from t1 where b in (100, 200);
>  Execution Time: 2.464 ms
>  Execution Time: 252.224 ms
>  Execution Time: 244.872 ms

Wow.  This is very cool work and I'm sure it will become a major
headline feature of PG14 if the requisite planner brains can be sorted
out.

On Mon, Mar 23, 2020 at 1:55 AM Floris Van Nee  wrote:
> I'm unsure which version number to give this patch (to continue with numbers 
> from previous skip scan patches, or to start numbering from scratch again). 
> It's a rather big change, so one could argue it's mostly a separate patch. I 
> guess it mostly depends on how close the original versions were to be 
> committable. Thoughts?

I don't know, but from the sidelines, it'd be nice to see the unique
path part go into PG13, where IIUC it can power the "useless unique
removal" patch.




Re: Index Skip Scan

2020-03-11 Thread David Rowley
On Wed, 11 Mar 2020 at 16:44, Andy Fan  wrote:
>>
>>
>> I think the UniqueKeys may need to be changed from using
>> EquivalenceClasses to use Exprs instead.
>
>
> When I try to understand why UniqueKeys needs EquivalenceClasses,
> see your comments here.  I feel that  FuncExpr can't be
> used to as a UniquePath even we can create unique index on f(a)
> and f->strict == true.  The reason is even we know a is not null,
>  f->strict = true.  it is still be possible that f(a) == null.  unique index
> allows more than 1 null values.  so shall we move further to use varattrno
> instead of Expr?  if so,  we can also use a list of Bitmapset to present multi
> unique path of a single RelOptInfo.

We do need some method to determine if NULL values are possible. At
the base relation level that can probably be done by checking NOT NULL
constraints and strict base quals. At higher levels, we can use strict
join quals as proofs.

As for bit a Bitmapset of varattnos, that would certainly work well at
the base relation level when there are no unique expression indexes,
but it's not so simple with join relations when the varattnos only
mean something when you know which base relation it comes from.  I'm
not saying that Lists of Exprs is ideal, but I think trying to
optimise some code that does not yet exist is premature.

There was some other talk in [1] on how we might make checking if a
List contains a given Node.  That could be advantageous in a few
places in the query planner, and it might be useful for this too.

[1] 
https://www.postgresql.org/message-id/CAKJS1f8v-fUG8YpaAGj309ZuALo3aEk7f6cqMHr_AVwz1fKXug%40mail.gmail.com




Re: Index Skip Scan

2020-03-11 Thread Andy Fan
On Tue, Mar 10, 2020 at 4:32 AM James Coleman  wrote:

> On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthali...@gmail.com>
> wrote:
> >
> > Assuming we'll implement it in a way that we do not know about what kind
> > of path type is that in create_distinct_path, then it can also work for
> > ProjectionPath or anything else (if UniqueKeys are present). But then
> > still EquivalenceMember are used only to figure out correct
> > distinctPrefixKeys and do not affect whether or not skipping is applied.
> > What do I miss?
>
>
> Part of the puzzle seems to me to this part of the response:
>
> > I think the UniqueKeys may need to be changed from using
> > EquivalenceClasses to use Exprs instead.
>
> But I can't say I'm being overly helpful by pointing that out, since I
> don't have my head in the code enough to understand how you'd
> accomplish that :)
>
>
There was a dedicated thread  [1]  where David explain his idea very
detailed,
and you can also check that messages around that message for the context.
hope it helps.

[1]
https://www.postgresql.org/message-id/CAApHDvq7i0%3DO97r4Y1pv68%2BtprVczKsXRsV28rM9H-rVPOfeNQ%40mail.gmail.com


Re: Index Skip Scan

2020-03-10 Thread Andy Fan
>
>
> I think the UniqueKeys may need to be changed from using
> EquivalenceClasses to use Exprs instead.
>

When I try to understand why UniqueKeys needs EquivalenceClasses,
see your comments here.  I feel that  FuncExpr can't be
used to as a UniquePath even we can create unique index on f(a)
and f->strict == true.  The reason is even we know a is not null,
 f->strict = true.  it is still be possible that f(a) == null.  unique index
allows more than 1 null values.  so shall we move further to use varattrno
instead of Expr?  if so,  we can also use a list of Bitmapset to present
multi
unique path of a single RelOptInfo.


Re: Index Skip Scan

2020-03-10 Thread David Rowley
On Wed, 11 Mar 2020 at 01:38, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote:
> > There's also some weird looking assumptions that an EquivalenceMember
> > can only be a Var in create_distinct_paths(). I think you're only
> > saved from crashing there because a ProjectionPath will be created
> > atop of the IndexPath to evaluate expressions, in which case you're
> > not seeing the IndexPath.  This results in the optimisation not
> > working in cases like:
>
> I've read it as "an assumption that an EquivalenceMember can only be a
> Var" results in "the optimisation not working in cases like this". But
> you've meant that ignoring a ProjectionPath with an IndexPath inside
> results in this optimisation not working, right? If so, then everything
> is clear, and my apologies, maybe I need to finally fix my sleep
> schedule :)

Yes, I was complaining that a ProjectionPath breaks the optimisation
and I don't believe there's any reason that it should.

I believe the way to make that work correctly requires paying
attention to the Path's uniquekeys rather than what type of path it
is.




Re: Index Skip Scan

2020-03-10 Thread Dmitry Dolgov
> >On Tue, Mar 10, 2020 at 09:29:32AM +1300, David Rowley wrote:
> On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Assuming we'll implement it in a way that we do not know about what kind
> > of path type is that in create_distinct_path, then it can also work for
> > ProjectionPath or anything else (if UniqueKeys are present). But then
> > still EquivalenceMember are used only to figure out correct
> > distinctPrefixKeys and do not affect whether or not skipping is applied.
> > What do I miss?
>
> I'm not sure I fully understand the question correctly, but let me
> explain further.
>
> In the 0001 patch, standard_qp_callback sets the query_uniquekeys
> depending on the DISTINCT / GROUP BY clause.  When building index
> paths in build_index_paths(), the 0002 patch should be looking at the
> root->query_uniquekeys to see if it can build any index paths that
> suit those keys.  Such paths should be tagged with the uniquekeys they
> satisfy, basically, exactly the same as how pathkeys work.  Many
> create_*_path functions will need to be modified to carry forward
> their uniquekeys. For example, create_projection_path(),
> create_limit_path() don't do anything which would cause the created
> path to violate the unique keys.   This way when you get down to
> create_distinct_paths(), paths other than IndexPath may have
> uniquekeys.  You'll be able to check which existing paths satisfy the
> unique keys required by the DISTINCT / GROUP BY and select those paths
> instead of having to create any HashAggregate / Unique paths.
>
> Does that answer the question?

Hmm... I'm afraid no, this was already clear. But looks like now I see
that I've misinterpreted one part.

> There's also some weird looking assumptions that an EquivalenceMember
> can only be a Var in create_distinct_paths(). I think you're only
> saved from crashing there because a ProjectionPath will be created
> atop of the IndexPath to evaluate expressions, in which case you're
> not seeing the IndexPath.  This results in the optimisation not
> working in cases like:

I've read it as "an assumption that an EquivalenceMember can only be a
Var" results in "the optimisation not working in cases like this". But
you've meant that ignoring a ProjectionPath with an IndexPath inside
results in this optimisation not working, right? If so, then everything
is clear, and my apologies, maybe I need to finally fix my sleep
schedule :)




Re: Index Skip Scan

2020-03-09 Thread James Coleman
On Mon, Mar 9, 2020 at 3:56 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> Assuming we'll implement it in a way that we do not know about what kind
> of path type is that in create_distinct_path, then it can also work for
> ProjectionPath or anything else (if UniqueKeys are present). But then
> still EquivalenceMember are used only to figure out correct
> distinctPrefixKeys and do not affect whether or not skipping is applied.
> What do I miss?


Part of the puzzle seems to me to this part of the response:

> I think the UniqueKeys may need to be changed from using
> EquivalenceClasses to use Exprs instead.

But I can't say I'm being overly helpful by pointing that out, since I
don't have my head in the code enough to understand how you'd
accomplish that :)

James




Re: Index Skip Scan

2020-03-09 Thread David Rowley
On Tue, 10 Mar 2020 at 08:56, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Assuming we'll implement it in a way that we do not know about what kind
> of path type is that in create_distinct_path, then it can also work for
> ProjectionPath or anything else (if UniqueKeys are present). But then
> still EquivalenceMember are used only to figure out correct
> distinctPrefixKeys and do not affect whether or not skipping is applied.
> What do I miss?

I'm not sure I fully understand the question correctly, but let me
explain further.

In the 0001 patch, standard_qp_callback sets the query_uniquekeys
depending on the DISTINCT / GROUP BY clause.  When building index
paths in build_index_paths(), the 0002 patch should be looking at the
root->query_uniquekeys to see if it can build any index paths that
suit those keys.  Such paths should be tagged with the uniquekeys they
satisfy, basically, exactly the same as how pathkeys work.  Many
create_*_path functions will need to be modified to carry forward
their uniquekeys. For example, create_projection_path(),
create_limit_path() don't do anything which would cause the created
path to violate the unique keys.   This way when you get down to
create_distinct_paths(), paths other than IndexPath may have
uniquekeys.  You'll be able to check which existing paths satisfy the
unique keys required by the DISTINCT / GROUP BY and select those paths
instead of having to create any HashAggregate / Unique paths.

Does that answer the question?




Re: Index Skip Scan

2020-03-09 Thread Dmitry Dolgov
> On Mon, Mar 09, 2020 at 10:27:26AM +1300, David Rowley wrote:
>
> > > I think the changes in create_distinct_paths() need more work.  The
> > > way I think this should work is that create_distinct_paths() gets to
> > > know exactly nothing about what path types support the elimination of
> > > duplicate values.  The Path should carry the UniqueKeys so that can be
> > > determined. In create_distinct_paths() you should just be able to make
> > > use of those paths, which should already have been created when
> > > creating index paths for the rel due to PlannerInfo's query_uniquekeys
> > > having been set.
> >
> > Just for me to clarify. The idea is to "move" information about what
> > path types support skipping into UniqueKeys (derived from PlannerInfo's
> > query_uniquekeys), but other checks (e.g. if index am supports that)
> > still perform in create_distinct_paths?
>
> create_distinct_paths() shouldn't know any details specific to the
> pathtype that it's using or considering using.  All the details should
> just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be
> no IsA(path, ...).  Please have a look over the details in my reply to
> Tomas. I hope that reply has enough information in it, but please
> reply there if I've missed something.

Yes, I've read this reply, just wanted to ask here, since I had other
questions as well. Speaking of which:

> > > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:
> > > There's also some weird looking assumptions that an EquivalenceMember
> > > can only be a Var in create_distinct_paths(). I think you're only
> > > saved from crashing there because a ProjectionPath will be created
> > > atop of the IndexPath to evaluate expressions, in which case you're
> > > not seeing the IndexPath.

I'm probably missing something, so to eliminate any misunderstanding
from my side:

> > > This results in the optimisation not working in cases like:
> > >
> > > postgres=# create table t (a int); create index on t ((a+1)); explain
> > > select distinct a+1 from t;
> > > CREATE TABLE
> > > CREATE INDEX
> > > QUERY PLAN
> > > ---
> > >  HashAggregate  (cost=48.25..50.75 rows=200 width=4)
> > >Group Key: (a + 1)
> > >->  Seq Scan on t  (cost=0.00..41.88 rows=2550 width=4)

In this particular example skipping is not applied because, as you've
mentioned, we're dealing with ProjectionPath (not IndexScan /
IndexOnlyScan). Which means we're not even reaching the code with
EquivalenceMember, so I'm still not sure how do they connected?

Assuming we'll implement it in a way that we do not know about what kind
of path type is that in create_distinct_path, then it can also work for
ProjectionPath or anything else (if UniqueKeys are present). But then
still EquivalenceMember are used only to figure out correct
distinctPrefixKeys and do not affect whether or not skipping is applied.
What do I miss?




Re: Index Skip Scan

2020-03-08 Thread David Rowley
On Mon, 9 Mar 2020 at 03:21, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> >
> > I've been looking over v32 of the patch and have a few comments
> > regarding the planner changes.
>
> Thanks for the commentaries!
>
> > I think the changes in create_distinct_paths() need more work.  The
> > way I think this should work is that create_distinct_paths() gets to
> > know exactly nothing about what path types support the elimination of
> > duplicate values.  The Path should carry the UniqueKeys so that can be
> > determined. In create_distinct_paths() you should just be able to make
> > use of those paths, which should already have been created when
> > creating index paths for the rel due to PlannerInfo's query_uniquekeys
> > having been set.
>
> Just for me to clarify. The idea is to "move" information about what
> path types support skipping into UniqueKeys (derived from PlannerInfo's
> query_uniquekeys), but other checks (e.g. if index am supports that)
> still perform in create_distinct_paths?

create_distinct_paths() shouldn't know any details specific to the
pathtype that it's using or considering using.  All the details should
just be in Path. e.g. uniquekeys, pathkeys, costs etc. There should be
no IsA(path, ...).  Please have a look over the details in my reply to
Tomas. I hope that reply has enough information in it, but please
reply there if I've missed something.

> > On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:
> > There's also some weird looking assumptions that an EquivalenceMember
> > can only be a Var in create_distinct_paths(). I think you're only
> > saved from crashing there because a ProjectionPath will be created
> > atop of the IndexPath to evaluate expressions, in which case you're
> > not seeing the IndexPath.  This results in the optimisation not
> > working in cases like:
> >
> > postgres=# create table t (a int); create index on t ((a+1)); explain
> > select distinct a+1 from t;
> > CREATE TABLE
> > CREATE INDEX
> > QUERY PLAN
> > ---
> >  HashAggregate  (cost=48.25..50.75 rows=200 width=4)
> >Group Key: (a + 1)
> >->  Seq Scan on t  (cost=0.00..41.88 rows=2550 width=4)
>
> Yes, I need to fix it.
>
> > Using unique paths as I mentioned above should see that fixed.
>
> I'm a bit confused about this statement, how exactly unique paths should
> fix this?

The path's uniquekeys would mention that it's unique on  (a+1). You'd
compare the uniquekeys of the path to the DISTINCT clause and see that
the uniquekeys are a subset of the DISTINCT clause therefore the
DISTINCT is a no-op. If that uniquekey path is cheaper than the
cheapest_total_path + , then you should
pick the unique path, otherwise use the cheapest_total_path and
uniquify that.

I think the UniqueKeys may need to be changed from using
EquivalenceClasses to use Exprs instead.




Re: Index Skip Scan

2020-03-08 Thread Dmitry Dolgov
> On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:
>
> I've been looking over v32 of the patch and have a few comments
> regarding the planner changes.

Thanks for the commentaries!

> I think the changes in create_distinct_paths() need more work.  The
> way I think this should work is that create_distinct_paths() gets to
> know exactly nothing about what path types support the elimination of
> duplicate values.  The Path should carry the UniqueKeys so that can be
> determined. In create_distinct_paths() you should just be able to make
> use of those paths, which should already have been created when
> creating index paths for the rel due to PlannerInfo's query_uniquekeys
> having been set.

Just for me to clarify. The idea is to "move" information about what
path types support skipping into UniqueKeys (derived from PlannerInfo's
query_uniquekeys), but other checks (e.g. if index am supports that)
still perform in create_distinct_paths?

> Also, should create_grouping_paths() be getting the same code?
> Jesper's UniqueKey patch seems to set query_uniquekeys when there's a
> GROUP BY with no aggregates. So it looks like he has intended that
> something like:
>
> SELECT x FROM t GROUP BY x;
>
> should work the same way as
>
> SELECT DISTINCT x FROM t;
>
> but the 0002 patch does not make this work.  Has that just been overlooked?

I believe it wasn't overlooked in 0002 patch, but rather added just in
case in 0001. I guess there are no theoretical problems in implementing
it, but since we wanted to keep scope of the patch under control and
concentrate on the existing functionality it probably makes sense to
plan it as one of the next steps?

> There's also some weird looking assumptions that an EquivalenceMember
> can only be a Var in create_distinct_paths(). I think you're only
> saved from crashing there because a ProjectionPath will be created
> atop of the IndexPath to evaluate expressions, in which case you're
> not seeing the IndexPath.  This results in the optimisation not
> working in cases like:
>
> postgres=# create table t (a int); create index on t ((a+1)); explain
> select distinct a+1 from t;
> CREATE TABLE
> CREATE INDEX
> QUERY PLAN
> ---
>  HashAggregate  (cost=48.25..50.75 rows=200 width=4)
>Group Key: (a + 1)
>->  Seq Scan on t  (cost=0.00..41.88 rows=2550 width=4)

Yes, I need to fix it.

> Using unique paths as I mentioned above should see that fixed.

I'm a bit confused about this statement, how exactly unique paths should
fix this?




Re: Index Skip Scan

2020-03-07 Thread David Rowley
On Sat, 7 Mar 2020 at 03:49, Tomas Vondra  wrote:
>
> On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:
> >The reason it must be done this way is that when the RelOptInfo that
> >we're performing the DISTINCT on is a joinrel, then we're not going to
> >see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort
> >of Join path instead.  I understand you're not yet supporting doing
> >this optimisation when joins are involved, but it should be coded in
> >such a way that it'll work when we do.  (It's probably also a separate
> >question as to whether we should have this only work when there are no
> >joins. I don't think I personally object to it for stage 1, but
> >perhaps someone else might think differently.)
> >
>
> I don't follow. Can you elaborate more?
>
> AFAICS skip-scan is essentially a capability of an (index) AM. I don't
> see how we could ever do that for joinrels? We can do that at the scan
> level, below a join, but that's what this patch already supports, I
> think. When you say "supporting this optimisation" with joins, do you
> mean doing skip-scan for join inputs, or on top of the join?

The skip index scan Path would still be created at the base rel level,
but the join path on the join relation would have one of the sub-paths
of the join as an index skip scan.

An example query that could make use of this is:

SELECT * FROM some_table WHERE a IN(SELECT
indexed_col_with_few_distinct_values FROM big_table);

In this case, we might want to create a Skip Scan path on big_table
using the index on the "indexed_col_with_few_distinct_values", then
Hash Join to "some_table". That class of query is likely stage 2 or 3
of this work, but we need to lay foundations that'll support it.

As for not having IndexScan paths in joinrels. Yes, of course, but
that's exactly why create_distinct_paths() cannot work the way the
patch currently codes it. The patch does:

+ /*
+ * XXX: In case of index scan quals evaluation happens
+ * after ExecScanFetch, which means skip results could be
+ * fitered out. Consider the following query:
+ *
+ * select distinct (a, b) a, b, c from t where  c < 100;
+ *
+ * Skip scan returns one tuple for one distinct set of (a,
+ * b) with arbitrary one of c, so if the choosed c does
+ * not match the qual and there is any c that matches the
+ * qual, we miss that tuple.
+ */
+ if (path->pathtype == T_IndexScan &&

which will never work for join relations since they'll only have paths
for Loop/Merge/Hash type joins.  The key here is to determine which
skip scan paths we should create when we're building the normal index
paths then see if we can make use of those when planning joins.
Subsequently, we'll then see if we can make use of the resulting join
paths during create_distinct_paths(). Doing it this way will allow us
to use skip scans in queries such as:

SELECT DISTINCT t1.z FROM t1 INNER JOIN t2 ON t1.a = t2.unique_col;

We'll first create the skip scan paths on t1, then when creating the
join paths we'll create additional join paths which use the skipscan
path. Because t1.unique_col will at most have 1 join partner for each
t2 row, then the join path will have the same unique_keys as the
skipscan path.  That'll allow us to use the join path which has the
skip scan on whichever side of the join the t1 relation ends up.  All
create_distinct_paths() should be doing is looking for paths that are
already implicitly unique on the distinct clause and consider using
those in a cost-based way. It shouldn't be making such paths itself.

> >For storing these new paths with UniqueKeys, I'm not sure exactly if
> >we can just add_path() such paths into the RelOptInfo's pathlist.
> >What we don't want to do is accidentally make use of paths which
> >eliminate duplicate values when we don't want that behaviour.  If we
> >did store these paths in RelOptInfo->pathlist then we'd need to go and
> >modify a bunch of places to ignore such paths. set_cheapest() would
> >have to do something special for them too, which makes me think
> >pathlist is the incorrect place.  Parallel query added
> >partial_pathlist, so perhaps we need unique_pathlist to make this
> >work.
> >
>
> Hmmm, good point. Do we actually produce incorrect plans with the
> current patch, using skip-scan path when we should not?

I don't think so. The patch is only creating skip scan paths on the
base rel when we discover it's valid to do so.  That's not the way it
should work though.  How the patch currently works would be similar to
initially only creating a SeqScan path for a query such as: SELECT *
FROM tab ORDER BY a;, but then, during  create_ordered_paths() go and
create some IndexPath to scan the btree index on tab.a because we
suddenly realise that it'll be good to use that for the

Re: Index Skip Scan

2020-03-06 Thread Tomas Vondra

On Wed, Mar 04, 2020 at 11:32:00AM +1300, David Rowley wrote:

On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthali...@gmail.com> wrote:

Here is something similar to what I had in mind.


(changing to this email address for future emails)

Hi,

I've been looking over v32 of the patch and have a few comments
regarding the planner changes.

I think the changes in create_distinct_paths() need more work.  The
way I think this should work is that create_distinct_paths() gets to
know exactly nothing about what path types support the elimination of
duplicate values.  The Path should carry the UniqueKeys so that can be
determined. In create_distinct_paths() you should just be able to make
use of those paths, which should already have been created when
creating index paths for the rel due to PlannerInfo's query_uniquekeys
having been set.



+1 to code this in a generic way, using query_uniquekeys (if possible)


The reason it must be done this way is that when the RelOptInfo that
we're performing the DISTINCT on is a joinrel, then we're not going to
see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort
of Join path instead.  I understand you're not yet supporting doing
this optimisation when joins are involved, but it should be coded in
such a way that it'll work when we do.  (It's probably also a separate
question as to whether we should have this only work when there are no
joins. I don't think I personally object to it for stage 1, but
perhaps someone else might think differently.)



I don't follow. Can you elaborate more?

AFAICS skip-scan is essentially a capability of an (index) AM. I don't
see how we could ever do that for joinrels? We can do that at the scan
level, below a join, but that's what this patch already supports, I
think. When you say "supporting this optimisation" with joins, do you
mean doing skip-scan for join inputs, or on top of the join?


For storing these new paths with UniqueKeys, I'm not sure exactly if
we can just add_path() such paths into the RelOptInfo's pathlist.
What we don't want to do is accidentally make use of paths which
eliminate duplicate values when we don't want that behaviour.  If we
did store these paths in RelOptInfo->pathlist then we'd need to go and
modify a bunch of places to ignore such paths. set_cheapest() would
have to do something special for them too, which makes me think
pathlist is the incorrect place.  Parallel query added
partial_pathlist, so perhaps we need unique_pathlist to make this
work.



Hmmm, good point. Do we actually produce incorrect plans with the
current patch, using skip-scan path when we should not?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Index Skip Scan

2020-03-03 Thread David Rowley
On Tue, 18 Feb 2020 at 05:24, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Here is something similar to what I had in mind.

(changing to this email address for future emails)

Hi,

I've been looking over v32 of the patch and have a few comments
regarding the planner changes.

I think the changes in create_distinct_paths() need more work.  The
way I think this should work is that create_distinct_paths() gets to
know exactly nothing about what path types support the elimination of
duplicate values.  The Path should carry the UniqueKeys so that can be
determined. In create_distinct_paths() you should just be able to make
use of those paths, which should already have been created when
creating index paths for the rel due to PlannerInfo's query_uniquekeys
having been set.

The reason it must be done this way is that when the RelOptInfo that
we're performing the DISTINCT on is a joinrel, then we're not going to
see any IndexPaths in the RelOptInfo's pathlist. We'll have some sort
of Join path instead.  I understand you're not yet supporting doing
this optimisation when joins are involved, but it should be coded in
such a way that it'll work when we do.  (It's probably also a separate
question as to whether we should have this only work when there are no
joins. I don't think I personally object to it for stage 1, but
perhaps someone else might think differently.)

For storing these new paths with UniqueKeys, I'm not sure exactly if
we can just add_path() such paths into the RelOptInfo's pathlist.
What we don't want to do is accidentally make use of paths which
eliminate duplicate values when we don't want that behaviour.  If we
did store these paths in RelOptInfo->pathlist then we'd need to go and
modify a bunch of places to ignore such paths. set_cheapest() would
have to do something special for them too, which makes me think
pathlist is the incorrect place.  Parallel query added
partial_pathlist, so perhaps we need unique_pathlist to make this
work.

Also, should create_grouping_paths() be getting the same code?
Jesper's UniqueKey patch seems to set query_uniquekeys when there's a
GROUP BY with no aggregates. So it looks like he has intended that
something like:

SELECT x FROM t GROUP BY x;

should work the same way as

SELECT DISTINCT x FROM t;

but the 0002 patch does not make this work.  Has that just been overlooked?

There's also some weird looking assumptions that an EquivalenceMember
can only be a Var in create_distinct_paths(). I think you're only
saved from crashing there because a ProjectionPath will be created
atop of the IndexPath to evaluate expressions, in which case you're
not seeing the IndexPath.  This results in the optimisation not
working in cases like:

postgres=# create table t (a int); create index on t ((a+1)); explain
select distinct a+1 from t;
CREATE TABLE
CREATE INDEX
QUERY PLAN
---
 HashAggregate  (cost=48.25..50.75 rows=200 width=4)
   Group Key: (a + 1)
   ->  Seq Scan on t  (cost=0.00..41.88 rows=2550 width=4)

Using unique paths as I mentioned above should see that fixed.

David




Re: Index Skip Scan

2020-02-17 Thread Dmitry Dolgov
*distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -1077,6 +1081,15 @@ typedef struct ParamPathInfo
 	List	   *ppi_clauses;	/* join clauses available from outer rels */
 } ParamPathInfo;
 
+/*
+ * UniqueKey
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+
+	EquivalenceClass *eq_clause;	/* equivalence class */
+} UniqueKey;
 
 /*
  * Type "Path" is used as-is for sequential-scan paths, as well as some other
@@ -1106,6 +1119,9 @@ typedef struct ParamPathInfo
  *
  * "pathkeys" is a List of PathKey nodes (see above), describing the sort
  * ordering of the path's output rows.
+ *
+ * "uniquekeys", if not NIL, is a list of UniqueKey nodes (see above),
+ * describing the XXX.
  */
 typedef struct Path
 {
@@ -1129,6 +1145,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/print.h b/src/include/nodes/print.h
index 6126b491bf..006248bfb5 100644
--- a/src/include/nodes/print.h
+++ b/src/include/nodes/print.h
@@ -28,6 +28,7 @@ extern char *pretty_format_node_dump(const char *dump);
 extern void print_rt(const List *rtable);
 extern void print_expr(const Node *expr, const List *rtable);
 extern void print_pathkeys(const List *pathkeys, const List *rtable);
+extern void print_uniquekeys(const List *uniquekeys, const List *rtable);
 extern void print_tl(const List *tlist, const List *rtable);
 extern void print_slot(TupleTableSlot *slot);
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..f75ff6f323 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -44,6 +44,7 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 	List *indexorderbys,
 	List *indexorderbycols,
 	List *pathkeys,
+	List *uniquekeys,
 	ScanDirection indexscandir,
 	bool indexonly,
 	Relids required_outer,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9ab73bd20c..5b6be383b3 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -214,6 +214,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 		   List *sortclauses,
 		   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+		  List *sortclauses,
+		  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 			RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -240,4 +243,12 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	List *live_childrels);
 
+/*
+ * uniquekey.c
+ *	  Utilities for matching and building unique keys
+ */
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+extern bool uniquekeys_contained_in(List *keys1, List *keys2);
+extern bool has_useful_uniquekeys(PlannerInfo *root);
+
 #endif			/* PATHS_H */
-- 
2.21.0

>From 1345f11ca221219a57a98552773963b4c3ceeac0 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 15 Nov 2019 09:46:53 -0500
Subject: [PATCH v32 2/2] Index skip scan

Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
on top of IndexOnlyScan and IndexScan. To make it suitable for both
situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c   |   1 +
 doc/src/sgml/config.sgml  |  15 +
 doc/src/sgml/indexam.sgml |  63 ++
 doc/src/sgml/indices.sgml |  23 +
 src/backend/access/brin/brin.c|   1 +
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/hash/hash.c|   1 +
 src/backend/access/index/indexam.c|  18 +
 src/backend/access/nbtree/nbtree.c|  13 +
 src/backend/access/nbtree/nbtsearch.c | 466 +-
 src/

Re: Index Skip Scan

2020-02-14 Thread Dmitry Dolgov
> On Fri, Feb 14, 2020 at 05:23:13PM +0900, Kyotaro Horiguchi wrote:
> The first attached (renamed to .txt not to confuse the cfbots) is a
> small patch that makes sure if _bt_readpage is called with the proper
> condition as written in its comment, that is, caller must have pinned
> and read-locked so->currPos.buf. This patch reveals many instances of
> breakage of the contract.

Thanks! On top of which patch version one can apply it? I'm asking
because I believe I've addressed similar issues in the last version, and
the last proposed diff (after resolving some conflicts) breaks tests for
me, so not sure if I miss something.

At the same time if you and Tomas strongly agree that it actually makes
sense to make moving forward/reading backward case work with dead tuples
correctly, I'll take a shot and try to teach the code around _bt_skip to
do what is required for that. I can merge your changes there and we can
see what would be the result.




  1   2   3   >