Re: [HACKERS] [PATCH] kNN for btree

2017-03-09 Thread Alexander Korotkov
On Thu, Mar 2, 2017 at 5:57 PM, David Steele  wrote:

> Hi Alexander,
>
> On 2/16/17 11:20 AM, Robert Haas wrote:
> > On Thu, Feb 16, 2017 at 10:59 AM, Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> On Thu, Feb 16, 2017 at 8:05 AM, Alexander Korotkov
> >>>  wrote:
>  My idea is that we need more general redesign of specifying ordering
> which
>  index can produce.  Ideally, we should replace amcanorder,
> amcanbackward and
>  amcanorderbyop with single callback.  Such callback should take a
> list of
>  pathkeys and return number of leading pathkeys index could satisfy
> (with
>  corresponding information for index scan).  I'm not sure that other
> hackers
>  would agree with such design, but I'm very convinced that we need
> something
>  of this level of extendability.  Otherwise we would have to hack our
> planner
>  <-> index_access_method interface each time we decide to cover
> another index
>  produced ordering.
> >>
> >>> Yeah.  I'm not sure if that's exactly the right idea.  But it seems
> >>> like we need something.
> >>
> >> That's definitely not exactly the right idea, because using it would
> >> require the core planner to play twenty-questions trying to guess which
> >> pathkeys the index can satisfy.  ("Can you satisfy some prefix of this
> >> pathkey list?  How about that one?")  It could be sensible to have a
> >> callback that's called once per index and hands back a list of pathkey
> >> lists that represent interesting orders the index could produce, which
> >> could be informed by looking aside at the PlannerInfo contents to see
> >> what is likely to be relevant to the query.
> >>
> >> But even so, I'm not convinced that that is a better design or more
> >> maintainable than the current approach.  I fear that it will lead to
> >> duplicating substantial amounts of code and knowledge into each index
> AM,
> >> which is not an improvement; and if anything, that increases the risk of
> >> breaking every index AM anytime you want to introduce some fundamentally
> >> new capability in the area.  Now that it's actually practical to have
> >> out-of-core index AMs, that's a bigger concern than it might once have
> >> been.
> >
> > Yeah, that's all true.  But I think Alexander is right that just
> > adding amcandoblah flags ad infinitum doesn't feel good either.  The
> > interface isn't really arm's-length if every new thing somebody wants
> > to do something new requires another flag.
> >
> >> Also see the discussion that led up to commit ed0097e4f.  Users objected
> >> the last time we tried to make index capabilities opaque at the SQL
> level,
> >> so they're not going to like a design that tries to hide that
> information
> >> even from the core C code.
> >
> > Discoverability is definitely important, but first we have to figure
> > out how we're going to make it work, and then we can work out how to
> > let users see how it works.
>
> Reading through this thread I'm concerned that this appears to be a big
> change making its first appearance in the last CF.  There is also the
> need for a new patch and a general consensus of how to proceed.
>

Yes, refactoring of amcanorder/amcanorderbyop should be very thoughtful.

I recommend moving this patch to 2017-07 or marking it RWF.
>

I agree. Done.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [PATCH] kNN for btree

2017-03-02 Thread David Steele
Hi Alexander,

On 2/16/17 11:20 AM, Robert Haas wrote:
> On Thu, Feb 16, 2017 at 10:59 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Thu, Feb 16, 2017 at 8:05 AM, Alexander Korotkov
>>>  wrote:
 My idea is that we need more general redesign of specifying ordering which
 index can produce.  Ideally, we should replace amcanorder, amcanbackward 
 and
 amcanorderbyop with single callback.  Such callback should take a list of
 pathkeys and return number of leading pathkeys index could satisfy (with
 corresponding information for index scan).  I'm not sure that other hackers
 would agree with such design, but I'm very convinced that we need something
 of this level of extendability.  Otherwise we would have to hack our 
 planner
 <-> index_access_method interface each time we decide to cover another 
 index
 produced ordering.
>>
>>> Yeah.  I'm not sure if that's exactly the right idea.  But it seems
>>> like we need something.
>>
>> That's definitely not exactly the right idea, because using it would
>> require the core planner to play twenty-questions trying to guess which
>> pathkeys the index can satisfy.  ("Can you satisfy some prefix of this
>> pathkey list?  How about that one?")  It could be sensible to have a
>> callback that's called once per index and hands back a list of pathkey
>> lists that represent interesting orders the index could produce, which
>> could be informed by looking aside at the PlannerInfo contents to see
>> what is likely to be relevant to the query.
>>
>> But even so, I'm not convinced that that is a better design or more
>> maintainable than the current approach.  I fear that it will lead to
>> duplicating substantial amounts of code and knowledge into each index AM,
>> which is not an improvement; and if anything, that increases the risk of
>> breaking every index AM anytime you want to introduce some fundamentally
>> new capability in the area.  Now that it's actually practical to have
>> out-of-core index AMs, that's a bigger concern than it might once have
>> been.
> 
> Yeah, that's all true.  But I think Alexander is right that just
> adding amcandoblah flags ad infinitum doesn't feel good either.  The
> interface isn't really arm's-length if every new thing somebody wants
> to do something new requires another flag.
> 
>> Also see the discussion that led up to commit ed0097e4f.  Users objected
>> the last time we tried to make index capabilities opaque at the SQL level,
>> so they're not going to like a design that tries to hide that information
>> even from the core C code.
> 
> Discoverability is definitely important, but first we have to figure
> out how we're going to make it work, and then we can work out how to
> let users see how it works.

Reading through this thread I'm concerned that this appears to be a big
change making its first appearance in the last CF.  There is also the
need for a new patch and a general consensus of how to proceed.

I recommend moving this patch to 2017-07 or marking it RWF.

Thanks,
-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] kNN for btree

2017-02-16 Thread Robert Haas
On Thu, Feb 16, 2017 at 10:59 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Feb 16, 2017 at 8:05 AM, Alexander Korotkov
>>  wrote:
>>> My idea is that we need more general redesign of specifying ordering which
>>> index can produce.  Ideally, we should replace amcanorder, amcanbackward and
>>> amcanorderbyop with single callback.  Such callback should take a list of
>>> pathkeys and return number of leading pathkeys index could satisfy (with
>>> corresponding information for index scan).  I'm not sure that other hackers
>>> would agree with such design, but I'm very convinced that we need something
>>> of this level of extendability.  Otherwise we would have to hack our planner
>>> <-> index_access_method interface each time we decide to cover another index
>>> produced ordering.
>
>> Yeah.  I'm not sure if that's exactly the right idea.  But it seems
>> like we need something.
>
> That's definitely not exactly the right idea, because using it would
> require the core planner to play twenty-questions trying to guess which
> pathkeys the index can satisfy.  ("Can you satisfy some prefix of this
> pathkey list?  How about that one?")  It could be sensible to have a
> callback that's called once per index and hands back a list of pathkey
> lists that represent interesting orders the index could produce, which
> could be informed by looking aside at the PlannerInfo contents to see
> what is likely to be relevant to the query.
>
> But even so, I'm not convinced that that is a better design or more
> maintainable than the current approach.  I fear that it will lead to
> duplicating substantial amounts of code and knowledge into each index AM,
> which is not an improvement; and if anything, that increases the risk of
> breaking every index AM anytime you want to introduce some fundamentally
> new capability in the area.  Now that it's actually practical to have
> out-of-core index AMs, that's a bigger concern than it might once have
> been.

Yeah, that's all true.  But I think Alexander is right that just
adding amcandoblah flags ad infinitum doesn't feel good either.  The
interface isn't really arm's-length if every new thing somebody wants
to do something new requires another flag.

> Also see the discussion that led up to commit ed0097e4f.  Users objected
> the last time we tried to make index capabilities opaque at the SQL level,
> so they're not going to like a design that tries to hide that information
> even from the core C code.

Discoverability is definitely important, but first we have to figure
out how we're going to make it work, and then we can work out how to
let users see how it works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] kNN for btree

2017-02-16 Thread Tom Lane
Robert Haas  writes:
> On Thu, Feb 16, 2017 at 8:05 AM, Alexander Korotkov
>  wrote:
>> My idea is that we need more general redesign of specifying ordering which
>> index can produce.  Ideally, we should replace amcanorder, amcanbackward and
>> amcanorderbyop with single callback.  Such callback should take a list of
>> pathkeys and return number of leading pathkeys index could satisfy (with
>> corresponding information for index scan).  I'm not sure that other hackers
>> would agree with such design, but I'm very convinced that we need something
>> of this level of extendability.  Otherwise we would have to hack our planner
>> <-> index_access_method interface each time we decide to cover another index
>> produced ordering.

> Yeah.  I'm not sure if that's exactly the right idea.  But it seems
> like we need something.

That's definitely not exactly the right idea, because using it would
require the core planner to play twenty-questions trying to guess which
pathkeys the index can satisfy.  ("Can you satisfy some prefix of this
pathkey list?  How about that one?")  It could be sensible to have a
callback that's called once per index and hands back a list of pathkey
lists that represent interesting orders the index could produce, which
could be informed by looking aside at the PlannerInfo contents to see
what is likely to be relevant to the query.

But even so, I'm not convinced that that is a better design or more
maintainable than the current approach.  I fear that it will lead to
duplicating substantial amounts of code and knowledge into each index AM,
which is not an improvement; and if anything, that increases the risk of
breaking every index AM anytime you want to introduce some fundamentally
new capability in the area.  Now that it's actually practical to have
out-of-core index AMs, that's a bigger concern than it might once have
been.

Also see the discussion that led up to commit ed0097e4f.  Users objected
the last time we tried to make index capabilities opaque at the SQL level,
so they're not going to like a design that tries to hide that information
even from the core C code.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] kNN for btree

2017-02-16 Thread Robert Haas
On Thu, Feb 16, 2017 at 8:05 AM, Alexander Korotkov
 wrote:
> My idea is that we need more general redesign of specifying ordering which
> index can produce.  Ideally, we should replace amcanorder, amcanbackward and
> amcanorderbyop with single callback.  Such callback should take a list of
> pathkeys and return number of leading pathkeys index could satisfy (with
> corresponding information for index scan).  I'm not sure that other hackers
> would agree with such design, but I'm very convinced that we need something
> of this level of extendability.  Otherwise we would have to hack our planner
> <-> index_access_method interface each time we decide to cover another index
> produced ordering.

Yeah.  I'm not sure if that's exactly the right idea.  But it seems
like we need something.

>> info->amcanorderbyop = (void (*)()) amroutine->amcanorderbyop;
>
> It's not necessary to use cast here.  For instance, we don't use cast for
> amcostestimate.

In fact, it's bad to use the cast here, because if in future the
signature of one of amroutine->amcanorderbyop is changed and
info->amcanorderbyop is not changed to match, then the cast will
prevent a compiler warning, but at runtime you may crash.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] kNN for btree

2017-02-16 Thread Alexander Korotkov
Hi, Nikita!

I have assigned as a reviewer for this patchset.  I took a fist look to
these patches.
At first, I'd like to notice that it's very cool that you picked up this
work.  I frequently hear people complains about lack of this feature.

 k  |  kNN-btree   |  kNN-GiST|  Opt. query   |  Seq. scan
> |  | (btree_gist) |  with UNION   |  with sort
> |--|--|---|
>   1 |  0.041 4 |  0.079 4 |   0.060 8 |  41.1 1824
>  10 |  0.048 7 |  0.091 9 |   0.09717 |  41.8 1824
> 100 |  0.10747 |  0.19252 |   0.342   104 |  42.3 1824
>1000 |  0.735   573 |  0.913   650 |   2.970  1160 |  43.5 1824
>   1 |  5.070  5622 |  6.240  6760 |  36.300 11031 |  54.1 1824
>  10 | 49.600 51608 | 61.900 64194 | 295.100 94980 | 115.0 1824


These results looks quite expected.  KNN-btree uses about half of blocks in
comparison with UNION query, and it's more than twice faster.  In
comparison with kNN-GiST there is still some win.

1. Introduce amcanorderbyop() function
>
> This patch transforms existing boolean AM property amcanorderbyop into a
> method
> (function pointer).  This is necessary because, unlike GiST, kNN for btree
> supports only a one ordering operator on the first index column and we
> need a
> different pathkey matching logic for btree (there was a corresponding
> comment
> in match_pathkeys_to_index()).  GiST-specific logic has been moved from
> match_pathkeys_to_index() to gistcanorderbyop().


I'm not very excited about this design of amcanorderbyop callback.
Introducing new callback from index access method to the planner should
imply quite good flexibility to the future.  In this particular signature
of callback I see no potential future use-cases than your implementation
for btree.  We could just add amcanorderbyonlyfisrtop property and that
would give us same level of flexibility I think.
With existing index types, we could cover much more orderings that we
currently do.  Some of possible cases:
1) "ORDER BY col" for btree_gist, SP-GiST text_ops
2) "ORDER BY col1, col2 <-> const" for btree_gist
3) "ORDER BY col1, col2 <-> const" for btree

I understand that #3 is quite hard task and I don't ask you to implement it
now.  But it would be nice if some day we decide to add #3, we wouldn't
have to change IndexAmRoutine definition.

My idea is that we need more general redesign of specifying ordering which
index can produce.  Ideally, we should replace amcanorder, amcanbackward
and amcanorderbyop with single callback.  Such callback should take a list
of pathkeys and return number of leading pathkeys index could satisfy (with
corresponding information for index scan).  I'm not sure that other hackers
would agree with such design, but I'm very convinced that we need something
of this level of extendability.  Otherwise we would have to hack our
planner <-> index_access_method interface each time we decide to cover
another index produced ordering.

6. Remove duplicate distance operators from contrib/btree_gist.
>
> References to their own distance operators in btree_gist opclasses are
> replaced with references to the built-in operators and than duplicate
> operators are dropped.  But if the user is using somewhere these operators,
> upgrade of btree_gist from 1.3 to 1.4 would fail.


The query in "btree_gist--1.3--1.4.sql" which directly touches system
catalogue to update opfamilies looks too hackery.  I think we shouldn't use
such queries until we have no other choice.
In this particular case we can update opfamilies using legal mechanism
"ALTER OPERATOR FAMILY name USING index_method ADD/DROP ... " (note that
operator name could be schema-qualified if needed).  This way wouldn't be
that brief, but it is much more correct.

Also this like catch my eyes.

> info->amcanorderbyop = (void (*)()) amroutine->amcanorderbyop;

It's not necessary to use cast here.  For instance, we don't use cast
for amcostestimate.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [PATCH] kNN for btree

2017-01-18 Thread Nikita Glukhov

Sorry for the broken formatting in my previous message.
Below is a corrected version of this message.


I'd like to present a series of patches that implements k-Nearest Neighbors
(kNN) search for btree, which can be used to speed up ORDER BY distance
queries like this:
SELECT * FROM events ORDER BY date <-> '2000-01-01'::date ASC LIMIT 100;

Now only GiST supports kNN, but kNN on btree can be emulated using
contrib/btree_gist.


Scanning algorithm
==

Algorithm is very simple: we use bidirectional B-tree index scan starting at
the point from which we measure the distance (target point).  At each step,
we advance this scan in the direction that has the  nearest point.  But when
the target point does not fall into the scanned range, we don't even need to
use a bidirectional scan here --- we can use ordinary unidirectional scan
in the right direction.


Performance results
===

Test database is taken from original kNN-GiST presentation (PGCon 2010).

Test query

SELECT * FROM events ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;

can be optimized to the next rather complicated UNION form,
which no longer requires kNN:

WITH
t1 AS (SELECT * FROM events WHERE date >= '1957-10-04'::date
   ORDER BY date ASC  LIMIT k),
t2 AS (SELECT * FROM events WHERE date <  '1957-10-04'::date
   ORDER BY date DESC LIMIT k),
t  AS (SELECT * FROM t1 UNION SELECT * FROM t2)
SELECT * FROM t ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;


In each cell of this table shown query execution time in milliseconds and
the number of accessed blocks:


 k  |  kNN-btree   |  kNN-GiST|  Opt. query   |  Seq. scan
|  | (btree_gist) |  with UNION   |  with sort
|--|--|---|
  1 |  0.041 4 |  0.079 4 |   0.060 8 |  41.1 1824
 10 |  0.048 7 |  0.091 9 |   0.09717 |  41.8 1824
100 |  0.10747 |  0.19252 |   0.342   104 |  42.3 1824
   1000 |  0.735   573 |  0.913   650 |   2.970  1160 |  43.5 1824
  1 |  5.070  5622 |  6.240  6760 |  36.300 11031 |  54.1 1824
 10 | 49.600 51608 | 61.900 64194 | 295.100 94980 | 115.0 1824


As you can see, kNN-btree can be two times faster than kNN-GiST (btree_gist)
when k < 1000, but the number of blocks read is roughly the same.


Implementation details
==

A brief description is given below for each of the patches:

1. Introduce amcanorderbyop() function

This patch transforms existing boolean AM property amcanorderbyop into a method
(function pointer).  This is necessary because, unlike GiST, kNN for btree
supports only a one ordering operator on the first index column and we need a
different pathkey matching logic for btree (there was a corresponding comment
in match_pathkeys_to_index()).  GiST-specific logic has been moved from
match_pathkeys_to_index() to gistcanorderbyop().

2. Extract substructure BTScanState from BTScanOpaque

This refactoring is necessary for bidirectional kNN-scan implementation.
Now, BTScanOpaque's substructure BTScanState containing only the fields
related to scan position is passed to some functions where the whole
BTScanOpaque was passed previously.

3. Extract get_index_column_opclass(), get_opclass_opfamily_and_input_type().

Extracted two simple common functions used in gistproperty() and
btproperty() (see the next patch).

4. Add kNN support to btree

  * Added additional optional BTScanState to BTScanOpaque for
bidirectional kNN scan.
  * Implemented bidirectional kNN scan.
  * Implemented logic for selecting kNN strategy
  * Implemented btcanorderbyop(), updated btproperty() and btvalidate()

B-tree user interface functions have not been altered because ordering
operators are used directly.

5. Add distance operators for some types

These operators for integer, float, date, time, timestamp, interval, cash and
oid types have been copied from contrib/btree_gist and added to the existing
btree opclasses as ordering operators.  Their btree_gist duplicates are removed
in the next patch.

6. Remove duplicate distance operators from contrib/btree_gist.

References to their own distance operators in btree_gist opclasses are
replaced with references to the built-in operators and than duplicate
operators are dropped.  But if the user is using somewhere these operators,
upgrade of btree_gist from 1.3 to 1.4 would fail.

7. Add regression tests for btree kNN.

Tests were added only after the built-in distance operators were added.


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers