Re: [HACKERS] KNN searches support for SP-GiST [GSOC'14]

2014-10-27 Thread Thom Brown
On 20 August 2014 08:09, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 On 08/20/2014 03:35 AM, Vladislav Sterzhanov wrote:

 Hi there, pg-Hackers!
 Here I go with the patch which brings up the possibility to perform
 nearest-neighbour searches on SP-GiSTs (as of now includes implementation
 for quad and kd trees). Pre-reviewed by my GSoC mentor Alexander Korotkov.
 Sample benchmarking script included in the attachment (dumps the current
 geonames archive and runs several searches on the (latitude, longitude)
 points), which demonstrates the dramatic improvements against plain
 searches and sorting.


 Cool!

 I think this needs a high-level description in access/spgist/README on how
 this works. You can probably copy-paste the similar description from gist's
 README, because it's the same design. If I understand correctly, the
 support functions return distances between the nodes and the query, and the
 SP-GiST code uses those distances to return the rows in order. An important
 detail there is that the distance returned for an inner node is a lower
 bound of the distance of any node in that branch.

 A binary heap would be a better data structure than a red-black tree for
 the queue of nodes to visit/return. I know that GiST also uses a red-black
 for the same thing, but that's no reason not to do better here. There is a
 binary heap implementation in the backend too (see
 src/backend/lib/binaryheap.c), so it's not any more difficult to use.

 Does the code handle ties correctly? The distance function is just an
 approximation, so it's possible that there are two items with same
 distance, but are not equal according to the ordering operator. As an
 extreme example, if you hack the distance function to always return 0.0, do
 you still get correct results (albeit slowly)?

 The new suppLen field in spgConfigOut is not mentioned in the docs. Why
 not support pass-by-value supplementary values?

 A couple of quick comments on the code:

  @@ -137,8 +138,8 @@ spg_quad_picksplit(PG_FUNCTION_ARGS)
  {
 spgPickSplitIn *in = (spgPickSplitIn *) PG_GETARG_POINTER(0);
 spgPickSplitOut *out = (spgPickSplitOut *) PG_GETARG_POINTER(1);
 -   int i;
 -   Point  *centroid;
 +   int i;
 +   Point *centroid;

  #ifdef USE_MEDIAN
 /* Use the median values of x and y as the centroid point */


 Please remove all unnecessary whitespace changes like this.

  @@ -213,14 +215,19 @@ spg_quad_inner_consistent(PG_FUNCTION_ARGS)
 }

 Assert(in-nNodes == 4);
 +
 +   if (in-norderbys  0)
 +   {
 +   out-distances = palloc(in-nNodes * sizeof (double *));
 +   }


 I think that should be sizeof(double).

  +   *distances = (double *) malloc(norderbys * sizeof (double *));


 No mallocs allowed in the backend, should be palloc.


Hi Vlad,

I've been revisiting the status of the GSoC projects for this year and saw
this had stalled.  Do you have time to address the remaining issues so we
can get your work committed?

Thanks

Thom


Re: [HACKERS] KNN searches support for SP-GiST [GSOC'14]

2014-08-20 Thread Heikki Linnakangas

On 08/20/2014 03:35 AM, Vladislav Sterzhanov wrote:

Hi there, pg-Hackers!
Here I go with the patch which brings up the possibility to perform
nearest-neighbour searches on SP-GiSTs (as of now includes implementation
for quad and kd trees). Pre-reviewed by my GSoC mentor Alexander Korotkov.
Sample benchmarking script included in the attachment (dumps the current
geonames archive and runs several searches on the (latitude, longitude)
points), which demonstrates the dramatic improvements against plain
searches and sorting.


Cool!

I think this needs a high-level description in access/spgist/README on 
how this works. You can probably copy-paste the similar description from 
gist's README, because it's the same design. If I understand correctly, 
the support functions return distances between the nodes and the query, 
and the SP-GiST code uses those distances to return the rows in order. 
An important detail there is that the distance returned for an inner 
node is a lower bound of the distance of any node in that branch.


A binary heap would be a better data structure than a red-black tree for 
the queue of nodes to visit/return. I know that GiST also uses a 
red-black for the same thing, but that's no reason not to do better 
here. There is a binary heap implementation in the backend too (see 
src/backend/lib/binaryheap.c), so it's not any more difficult to use.


Does the code handle ties correctly? The distance function is just an 
approximation, so it's possible that there are two items with same 
distance, but are not equal according to the ordering operator. As an 
extreme example, if you hack the distance function to always return 0.0, 
do you still get correct results (albeit slowly)?


The new suppLen field in spgConfigOut is not mentioned in the docs. Why 
not support pass-by-value supplementary values?


A couple of quick comments on the code:


@@ -137,8 +138,8 @@ spg_quad_picksplit(PG_FUNCTION_ARGS)
 {
spgPickSplitIn *in = (spgPickSplitIn *) PG_GETARG_POINTER(0);
spgPickSplitOut *out = (spgPickSplitOut *) PG_GETARG_POINTER(1);
-   int i;
-   Point  *centroid;
+   int i;
+   Point *centroid;

 #ifdef USE_MEDIAN
/* Use the median values of x and y as the centroid point */


Please remove all unnecessary whitespace changes like this.


@@ -213,14 +215,19 @@ spg_quad_inner_consistent(PG_FUNCTION_ARGS)
}

Assert(in-nNodes == 4);
+   
+   if (in-norderbys  0)
+   {
+   out-distances = palloc(in-nNodes * sizeof (double *));
+   }


I think that should be sizeof(double).


+   *distances = (double *) malloc(norderbys * sizeof (double *));


No mallocs allowed in the backend, should be palloc.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] KNN searches support for SP-GiST [GSOC'14]

2014-08-19 Thread Vladislav Sterzhanov
Hi there, pg-Hackers!
Here I go with the patch which brings up the possibility to perform
nearest-neighbour searches on SP-GiSTs (as of now includes implementation
for quad and kd trees). Pre-reviewed by my GSoC mentor Alexander Korotkov.
Sample benchmarking script included in the attachment (dumps the current
geonames archive and runs several searches on the (latitude, longitude)
points), which demonstrates the dramatic improvements against plain
searches and sorting. Regression tests included, compiles and runs
successfully under both of my Ubuntu 12.04 Server and 08/2014 Arch Linux.

Vlad Sterzhanov / Quadrocube
diff --git a/doc/src/sgml/spgist.sgml b/doc/src/sgml/spgist.sgml
index 56827e5..5214770 100644
--- a/doc/src/sgml/spgist.sgml
+++ b/doc/src/sgml/spgist.sgml
@@ -83,6 +83,7 @@
literalgt;gt;/
literalgt;^/
literal~=/
+   literallt;-gt;/
   /entry
  /row
  row
@@ -95,6 +96,7 @@
literalgt;gt;/
literalgt;^/
literal~=/
+   literallt;-gt;/
   /entry
  /row
  row
@@ -137,6 +139,10 @@
   supports the same operators but uses a different index data structure which
   may offer better performance in some applications.
  /para
+ para
+  By supporting the ordering lt;-gt; operator the quad_point_ops and kd_point_ops provide 
+  a user with the ability to perform a K-nearest-neighbour search over the indexed point dataset.
+ /para
 
 /sect1
 
@@ -539,9 +545,12 @@ CREATE FUNCTION my_inner_consistent(internal, internal) RETURNS void ...
 typedef struct spgInnerConsistentIn
 {
 ScanKey scankeys;   /* array of operators and comparison values */
+ScanKey		orderbyKeys;	/* array of ordering operators and comparison values */
 int nkeys;  /* length of array */
+int norderbys;  /* length of array */
 
 Datum   reconstructedValue; /* value reconstructed at parent */
+Datum		suppValue		/* supplimentary value of parent */
 int level;  /* current level (counting from zero) */
 boolreturnData; /* original data must be returned? */
 
@@ -559,6 +568,8 @@ typedef struct spgInnerConsistentOut
 int*nodeNumbers;/* their indexes in the node array */
 int*levelAdds;  /* increment level by this much for each */
 Datum  *reconstructedValues;/* associated reconstructed values */
+Datum	   *suppValues;		/* any additional data implementation needs to be stored in the child nodes */
+double	   **distances;		/* associated distances */
 } spgInnerConsistentOut;
 /programlisting
 
@@ -573,10 +584,15 @@ typedef struct spgInnerConsistentOut
In particular it is not necessary to check structfieldsk_flags/ to
see if the comparison value is NULL, because the SP-GiST core code
will filter out such conditions.
+   structfieldorderbyKeys/, of length structfieldnorderbys/,
+   describes ordering operators (if any) in the same fashion.
structfieldreconstructedValue/ is the value reconstructed for the
parent tuple; it is literal(Datum) 0/ at the root level or if the
functioninner_consistent/ function did not provide a value at the
parent level.
+   structfieldsuppValue/ is any additional value that an implementation
+   decided to store for the parent node (literal(Datum) 0/ in case the 
+   current node is root).
structfieldlevel/ is the current inner tuple's level, starting at
zero for the root level.
structfieldreturnData/ is literaltrue/ if reconstructed data is
@@ -592,7 +608,6 @@ typedef struct spgInnerConsistentOut
structfieldnNodes/ is the number of child nodes contained in the
inner tuple, and
structfieldnodeLabels/ is an array of their label values, or
-   NULL if the nodes do not have labels.
   /para
 
   para
@@ -608,9 +623,17 @@ typedef struct spgInnerConsistentOut
structfieldreconstructedValues/ to an array of the values
reconstructed for each child node to be visited; otherwise, leave
structfieldreconstructedValues/ as NULL.
+   structfieldsuppValues/ serves the similiar purpose of holding
+   the implementation-defined data for the inner nodes.
+   structfielddistances/ if the ordered search is carried out,
+   the implementation is supposed to fill them in accordance to the
+   ordering operators provided in structfieldorderbyKeys/
+   (nodes with lowest distances will be processed first). Leave it
+   NULL otherwise.
Note that the functioninner_consistent/ function is
responsible for palloc'ing the
-   structfieldnodeNumbers/, structfieldlevelAdds/ and
+   structfieldnodeNumbers/, structfieldlevelAdds/,
+   structfielddistances/, structfieldsuppValues/ and
structfieldreconstructedValues/ arrays.
   /para
  /listitem
@@ -636,7 +659,9 @@ CREATE FUNCTION