Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Jay Levitt

Alexander Korotkov wrote:

On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt 


Ah, yes, exactly the same problem. So what led you to add a flag instead
of using the range NULL..NULL? I'm on the fence about choosing.


At first, range bounds can't be NULL :) At second, if we have range
(a;b)+"contain empty" in internal page, both facts:
1) All normal underlying ranges are contained in (a;b).
2) There can be empty underlying ranges.
are useful for search.


That makes sense; you're essentially keeping one bit of stats about the 
values present in the range.


I wonder: if I'm indexing a rowtype, then for each column in the row I need 
to store a lower-left and an upper-right bound, plus a might-have-nulls 
flag.  Sounds a lot like a range. Should I just use ranges for that? See a 
downside (overhead)? See an upside (seems less duplicative somehow)? I'm 
fine depending on 9.2.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Alexander Korotkov
On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt  wrote:

> Alexander Korotkov wrote:
>
>> On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt > > wrote:
>>
>> At first I thought this posed a challenge for union; if I have these
> points:
>
>>
>>(1,2)
>>(2,1)
>>(1,NULL)
>>
>>what's the union? I think the answer is to treat NULL box coordinates
>>like LL = -infinity, UR = infinity, or (equivalently, I think) to store
>>a saw_nulls bit in addition to LL and UR.
>>
>> Similar problem appears at GiST indexing of ranges, because range can be
>> empty. There additional "contain empty" flag was introduced. This "contain
>> empty" flag indicates that underlying value can be empty. So, this flag is
>> set when union with empty range or other range with this flag set. It's
>> likely you need similar flag for each dimension.
>>
>
> Ah, yes, exactly the same problem. So what led you to add a flag instead
> of using the range NULL..NULL? I'm on the fence about choosing.


At first, range bounds can't be NULL :) At second, if we have range
(a;b)+"contain empty" in internal page, both facts:
1) All normal underlying ranges are contained in (a;b).
2) There can be empty underlying ranges.
are useful for search.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Jay Levitt

Alexander Korotkov wrote:

On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt mailto:jay.lev...@gmail.com>> wrote:

At first I thought this posed a challenge for union; if I have these 
points:


(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates
like LL = -infinity, UR = infinity, or (equivalently, I think) to store
a saw_nulls bit in addition to LL and UR.

Similar problem appears at GiST indexing of ranges, because range can be
empty. There additional "contain empty" flag was introduced. This "contain
empty" flag indicates that underlying value can be empty. So, this flag is
set when union with empty range or other range with this flag set. It's
likely you need similar flag for each dimension.


Ah, yes, exactly the same problem. So what led you to add a flag instead of 
using the range NULL..NULL? I'm on the fence about choosing.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Alexander Korotkov
On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt  wrote:

> Tom Lane wrote:
>
>> Jay Levitt  writes:
>>
>>> - Does KNN-GiST run into problems when<->  returns values that don't
>>> "make
>>>
>>> sense" in the physical world?
>>>
>>
>> If the indexed entities are records, it would be
>> entirely your own business how you handled individual fields being NULL.
>>
>
> This turns out to be a bit challenging. Let's say I'm building a
> nullable_point type that allows the Y axis to be NULL (or any sentinel
> value for "missing data"), where the semantics are "NULL is infinitely far
> from the query".   I'll need my GiST functions to return useful results
> with NULL - not just correct results, but results that help partition the
> tree nicely.
>
> At first I thought this posed a challenge for union; if I have these
> points:
>
> (1,2)
> (2,1)
> (1,NULL)
>
> what's the union? I think the answer is to treat NULL box coordinates like
> LL = -infinity, UR = infinity, or (equivalently, I think) to store a
> saw_nulls bit in addition to LL and UR.
>
> The real challenge is probably in picksplit and penalty - where in the
> tree should I stick (1,NULL)? - at which point you say "Yes, algorithms for
> efficient indexes are hard work and computer-science-y" and point me at
> surrogate splitters.
>
> Just thinking out loud, I guess; if other GiST types have addressed this
> problem, I'd love to hear about it.


Similar problem appears at GiST indexing of ranges, because range can be
empty. There additional "contain empty" flag was introduced. This "contain
empty" flag indicates that underlying value can be empty. So, this flag is
set when union with empty range or other range with this flag set. It's
likely you need similar flag for each dimension.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Jay Levitt

Tom Lane wrote:

Jay Levitt  writes:

- Does KNN-GiST run into problems when<->  returns values that don't "make
sense" in the physical world?


If the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.


This turns out to be a bit challenging. Let's say I'm building a 
nullable_point type that allows the Y axis to be NULL (or any sentinel value 
for "missing data"), where the semantics are "NULL is infinitely far from 
the query".   I'll need my GiST functions to return useful results with NULL 
- not just correct results, but results that help partition the tree nicely.


At first I thought this posed a challenge for union; if I have these points:

(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates like 
LL = -infinity, UR = infinity, or (equivalently, I think) to store a 
saw_nulls bit in addition to LL and UR.


The real challenge is probably in picksplit and penalty - where in the tree 
should I stick (1,NULL)? - at which point you say "Yes, algorithms for 
efficient indexes are hard work and computer-science-y" and point me at 
surrogate splitters.


Just thinking out loud, I guess; if other GiST types have addressed this 
problem, I'd love to hear about it.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Tom Lane
Jay Levitt  writes:
> Tom Lane wrote:
>>> - Can domains have operators, or are operators defined on types?
>> 
>> I think the current state of play is that you can have such things but
>> the system will only consider them for exact type matches, so you might
>> need more explicit casts than you ordinarily would.

> Turns out it's even smarter than that; it seems to coerce when it's 
> unambiguous:

Yeah, that sort of case will probably work all right.  The thing to keep
in mind is that operators/functions declared to take domains are
definitely second-class citizens, and will lose out in many
somewhat-ambiguous cases where an operator on a base type could get
selected due to the ambiguity resolution rules.  For your application it
will probably help if you can pick an operator name that's not in use
for any operation on the base type(s).  Also, it's conceivable that it
won't matter much to you, since I gather these operators will mostly get
invoked "behind the scenes" and not directly written in queries; it
should not be too hard to avoid ambiguity in mechanically-generated
references.

(There was a good deal of chatter some years ago about trying to improve
support of functions taking domains, but nothing's gotten done yet.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Jay Levitt

Tom Lane wrote:

- Can domains have operators, or are operators defined on types?


I think the current state of play is that you can have such things but
the system will only consider them for exact type matches, so you might
need more explicit casts than you ordinarily would.


Turns out it's even smarter than that; it seems to coerce when it's unambiguous:

create domain birthdate as date;
create function date_dist(birthdate, birthdate) returns integer as $$
select 123;
$$ language sql;
create operator <-> (
procedure = date_dist,
leftarg = birthdate,
rightarg = birthdate);

select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate;
-- 123

select '2012-01-01'::date <-> '2012-01-01'::date ;
-- 123


create domain activity_date as date;
create function date_dist(activity_date, activity_date)
returns integer as $$
  select 432;
$$ language sql;
create operator <-> (
procedure = date_dist,
leftarg = activity_date,
rightarg = activity_date);

select '2012-01-01'::activity_date <-> '2012-01-01'::activity_date;
-- 432

select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate;
-- 123

select '2012-01-01'::date <-> '2012-01-01'::date ;
-- ERROR:  operator is not unique: date <-> date

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Tom Lane
Jay Levitt  writes:
> Perfect. Composite types are exactly what I need here; the application can 
> declare its composite type and provide distance functions for each member, 
> and the extension can use those to calculate similarity. How do I introspect 
> the composite type's pg_class to see what it contains? I assume there's a 
> better way than SPI on system catalogs :)

Definitely.  Take a look at record_out, record_cmp, and sibling
functions on generic composites (src/backend/utils/adt/rowtypes.c).
You might or might not feel like wiring in more assumptions than those
have about the possible contents of the record type.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Jay Levitt

Tom Lane wrote:

Jay Levitt  writes:

- I'm not sure how to represent arbitrary column-like features without
reinventing the wheel and putting a database in the database.


ISTM you could define a composite type and then create operators and an
operator class over that type.  If you were trying to make a btree
opclass there might be a conflict with the built-in record_ops opclass,
but since you're only interested in GIST I don't see any real
roadblocks.


Perfect. Composite types are exactly what I need here; the application can 
declare its composite type and provide distance functions for each member, 
and the extension can use those to calculate similarity. How do I introspect 
the composite type's pg_class to see what it contains? I assume there's a 
better way than SPI on system catalogs :) Should I be using systable_* 
functions from genam, or is there an in-memory tree? I feel like funcapi 
gets me partway there but there's magic in the middle.


Can you think of any code that would serve as a sample, maybe whatever 
creates the output for psql's \d?




The main potential disadvantage of this is that you'd have
the standard tuple header as overhead in index entries --- but maybe the
entries are large enough that that doesn't matter, and in any case you
could probably make use of the GIST "compress" method to get rid of most
of the header.  Maybe convert to MinimalTuple, for instance, if you want
to still be able to leverage existing support code for field extraction.


Probably not worth it to save the 8 bytes; we're starting out at about 20 
floats per row. But good to know for later optimization...





- Can domains have operators, or are operators defined on types?


I think the current state of play is that you can have such things but
the system will only consider them for exact type matches, so you might
need more explicit casts than you ordinarily would.  However, we only
support domains over base types not composites, so this isn't really
going to be a profitable direction for you anyway.


Actually, as mentioned to Alexander, I'm thinking of domains per feature, 
not for the overall tuple, so birthdate<->birthdate differs from 
now()<->posting_date.  Sounds like that might work - I'll play.



- Does KNN-GiST run into problems when<->  returns values that don't "make
sense" in the physical world?


Wouldn't surprise me.  In general, non-strict index operators are a bad
idea.  However, if the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.


Yeah, that example conflated NULLs in the feature fields (we don't know your 
birthdate) with <-> on the whole tuple.  Oops.


I guess I can just test this by verifying that KNN-GiST ordered by distance 
returns the same results as without the index.


Thanks for your help here.

Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Jay Levitt

Alexander Korotkov wrote:

On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt mailto:jay.lev...@gmail.com>> wrote:

- But a dimension might be in any domain, not just floats
- The distance along each dimension is a domain-specific function

What exact domains do you expect? Some domains could appear to be quite hard
for index-based similarity search using GiST (for example, sets, strings etc.).


Oh, nothing nearly so complex, and (to Tom's point) no composite types, 
either. Right now we have demographics like gender, geolocation, and 
birthdate; I think any domain will be a type that's easily expressible in 
linear terms.  I was thinking in domains rather than types because there 
isn't one distance function for "date" or "float"; me.birthdate <-> 
you.birthdate "birthdate" is normalized to a different curve than now() <-> 
posting_date, and raw_score <-> raw_score would differ from z_score <-> z_score.


It would have been elegant to express that distance with <->, but since 
domains can't have operators, I can create distance(this, other) functions 
for each domain. It just won't look as pretty!


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Alexander Korotkov
On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt  wrote:

> - But a dimension might be in any domain, not just floats
> - The distance along each dimension is a domain-specific function
>

What exact domains do you expect? Some domains could appear to be quite
hard for index-based similarity search using GiST (for example, sets,
strings etc.).

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-15 Thread Tom Lane
Jay Levitt  writes:
> - I'm not sure how to represent arbitrary column-like features without 
> reinventing the wheel and putting a database in the database.

ISTM you could define a composite type and then create operators and an
operator class over that type.  If you were trying to make a btree
opclass there might be a conflict with the built-in record_ops opclass,
but since you're only interested in GIST I don't see any real
roadblocks.  The main potential disadvantage of this is that you'd have
the standard tuple header as overhead in index entries --- but maybe the
entries are large enough that that doesn't matter, and in any case you
could probably make use of the GIST "compress" method to get rid of most
of the header.  Maybe convert to MinimalTuple, for instance, if you want
to still be able to leverage existing support code for field extraction.

> - Can domains have operators, or are operators defined on types?

I think the current state of play is that you can have such things but
the system will only consider them for exact type matches, so you might
need more explicit casts than you ordinarily would.  However, we only
support domains over base types not composites, so this isn't really
going to be a profitable direction for you anyway.

> - Does KNN-GiST run into problems when <-> returns values that don't "make 
> sense" in the physical world?

Wouldn't surprise me.  In general, non-strict index operators are a bad
idea.  However, if the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers