[HACKERS] GSOC Student Project Idea

2013-04-22 Thread Michael Schuh
Greetings,

Hello, my name is Michael Schuh and I am a PhD student in Computer Science
at Montana State University. I have never participated in GSOC before, but
I am very excited to propose a project to PostgreSQL that I feel would be a
great follow-up to last year's project by Alexander Korotkov (
http://www.google-melange.com/gsoc/project/google/gsoc2012/akorotkov/53002).
I contacted Mr. Korotkov's mentor from last year, Mr. Heikki Linnakangas,
and he suggested I email this mailing list with my idea.

In brief, I would like to implement a state-of-the-art indexing algorithm
(named iDistance) directly in PostgreSQL using GiST or SP-GiST trees and
whatever means necessary. It is an ideal follow-up to last year's project
with Mr. Korotkov, which implemented classical indexing structures for
range queries. I strongly believe the community would greatly benefit from
the inclusion of iDistance, which has been shown to be dramatically more
effective than R-trees and KD-trees, especially for knn queries and above
10-20 dimensions.

A major focus of my current PhD thesis is high-dimensional data indexing
and retrieval, with an emphasis towards applied use in CBIR systems.
Recently, I published work which introduced a new open source
implementation of iDistance in C++ (and some Python), which I believe makes
me highly qualified and motivated for this opportunity. I have been
strongly considering a PostgreSQL implementation for an easy plug-and-play
use in existing applications, but with academic grant funding, the priority
is low. Below are links to my google code repository and recent
publication. I am happy to discuss any of this in further detail if you'd
like.

https://code.google.com/p/idistance/
http://www.cs.montana.edu/~timothy.wylie/files/bncod13.pdf

Although I do not have a lot of experience with PostgreSQL development, I
am eager to learn and commit my summer to enabling another fantastic
feature for the community. Since iDistance is a non-recursive, data-driven,
space-based partitioning strategy which builds directly onto a B+-tree, I
believe the implementation should be possible using only GiST support.
Please let me know if this is of any interest, or if you have any
additional questions. Unfortunately, I will be unavailable most of the day,
but I plan to fill out the GSOC application later this evening.

Thank you for your time,
Mike Schuh


Re: [HACKERS] GSOC Student Project Idea

2013-04-24 Thread Michael Schuh
On Wed, Apr 24, 2013 at 5:31 AM, Florian Pflug f...@phlo.org wrote:

 On Apr23, 2013, at 23:25 , Alexander Korotkov aekorot...@gmail.com
 wrote:
  I've taken a brief look on the paper and implementation. As I can see
 iDistance implements some global building strategy. I mean, for example, it
 selects some point, calculates distances from selected point to all points
 in dataset etc. So, it uses the whole dataset at the same time.
 
  However you can try to implement global index building in GiST or
 SP-GiST. In this case I think you should carefully estimate your
 capabilities during single GSoC project. You would need to extend GiST or
 SP-GiST interface and write completely new implementation of tree building
 algorithm. Question of how to exactly extend GiST or SP-GiST interface for
 this case could appear to be very hard even theoretically.

 +1. That seemed to be a major roadblock to me too when I read the paper.

 You could work around that by making partition identification a separate
 step. You'd have a function

   idist_analyze(cfg name, table name, field name)

 which'd identify suitable partitions for the data distribution in
 table.field and store them somewhere. Such a set of pre-identified
 partitions would be akin to a tsearch configuration, i.e. all other parts
 of the iDistance machinery would use it to map points to index keys and
 queries to ranges of those keys. You'll want to look at how tsearch handles
 that, and check if the method can indeed be applied to iDistance.

 In a first cut, you'd probably only allow inserts into index which don't
 change the maximum distances from the partition centers that
 idist_analyze() found. With that restriction in place, you might get away
 without GiST or SP-GiSt, and simply use postgres' standard btree, if you
 find a way to map queries of the form field idistance-knn-operator
 idistance_knn_function(point, distance) to where (idstance_keys(field)
 between P1_lowerbound AND P2_upperbound) OR (idistance_keys(field) between
 P2_lowerbuild AND P2_upperbound) … or something equivalent.



Thank you both for the very helpful feedback. Perhaps the scope of this
project (application's completeness criteria) is better as
a feasibility prototyping of the global/distance-based index strategy with
B+-tree and/or GiST extension possibilities.

Let me briefly elaborate on the current implementation details related to
your comments. The public C++ version uses a stand-alone STX B+-tree
implementation, and has only been used in the research context of one-time
data loading and indexing and then KNN retrieval performance efficiency.
This requires an upfront global partition building (sequential look at all
data points) and a bit of overhead information about the partitions (such
as reference point locations and maximum distances in each). Of course,
there are a lot of improvements, modifications, variants, etc., that can be
built from this basic setup... but that's mostly my in-progress thesis work
yet published or defended :)

A given query is broken down into range(s) of key values in the B+-tree,
based on the negligible overhead info kept from partitioning. Only then
does this small subset of pages need to be loaded from disk. where the
partitions are located with respect to the query. Therefore it is necessary
to have left/right leaf pointers within the B+-tree. While I think a
DBMS-integrated implementation would be more ideal for general use by
everyone, my naive assumption is that I could probably implement the
idistance functionality in stored procedures and metadata tables (at the
cost of performance and usability).

Best regards,
Mike Schuh


Re: [HACKERS] GSOC Student Project Idea

2013-05-11 Thread Michael Schuh
On Wed, May 8, 2013 at 1:48 PM, Jim Nasby j...@nasby.net wrote:

 On 5/8/13 3:54 AM, Heikki Linnakangas wrote:

 On 24.04.2013 14:31, Florian Pflug wrote:

 On Apr23, 2013, at 23:25 , Alexander Korotkovaekorot...@gmail.com
 wrote:

 I've taken a brief look on the paper and implementation. As I can
 see iDistance implements some global building strategy. I mean, for
 example, it selects some point, calculates distances from selected
 point to all points in dataset etc. So, it uses the whole dataset
 at the same time.

 However you can try to implement global index building in GiST or
 SP-GiST. In this case I think you should carefully estimate your
 capabilities during single GSoC project. You would need to extend
 GiST or SP-GiST interface and write completely new implementation
 of tree building algorithm. Question of how to exactly extend GiST
 or SP-GiST interface for this case could appear to be very hard
 even theoretically.


 +1. That seemed to be a major roadblock to me too when I read the
 paper.

 You could work around that by making partition identification a
 separate step. You'd have a function

 idist_analyze(cfg name, table name, field name)

 which'd identify suitable partitions for the data distribution in
 table.field and store them somewhere. Such a set of pre-identified
 partitions would be akin to a tsearch configuration, i.e. all other
 parts of the iDistance machinery would use it to map points to index
 keys and queries to ranges of those keys. You'll want to look at how
 tsearch handles that, and check if the method can indeed be applied
 to iDistance.


 You could perform that step as part of the index build. Before the index
 build starts to add tuples to the index, it could scan a random sample of
 the heap and identify the partitions based on that.

 If you need to store the metadata, like a map of partitions, it becomes
 difficult to cajole this into a normal GiST or SP-GiST opclass. The API
 doesn't have any support for storing such metadata.

  In a first cut, you'd probably only allow inserts into index which
 don't change the maximum distances from the partition centers that
 idist_analyze() found.


 That seems like a pretty serious restriction. I'd try to write it so that
 you can insert any value, but if the new values are very different from any
 existing values, it would be OK for the index quality to degrade. For
 example, you could simply add any out-of-bounds values to a separate branch
 in the index, which would have no particular structure and would just have
 to be scanned on every query. You can probably do better than that, but
 that would be a trivial way to deal with it.


 Or you could use the new insert to start a new partition.

 Heck, maybe the focus should actually be on partitions and not individual
 records/points. ISTM the entire challenge here is figuring out a way to
 maintain a set of partitions that:

 - Are limited enough in number that you can quickly perform
 operations/searches across all partitions
 - Yet small enough that once you've narrowed down a set of partitions you
 don't have a ton of raw records to still look at

 Before we had range types I experimented with representing time ranges as
 rectangles of varying size (ie: for (start, end), create
 rectangle(point(start,start), point(end,end)). The problem with that is you
 had to convert timestamp into a float, which was not exact. So when
 querying you could use a GiST index on all the rectangles to narrow your
 scope, but you still needed a set of exact clauses (ie: start = now() - '1
 year' AND end = now()). Partitions would be similar in that they wouldn't
 be exact but could greatly narrow the search space (of course we'd want to
 handle the secondary exact checking internally instead of exposing the user
 to that).



I appreciate all the responses, and I think everyone has more-or-less
confirmed the scope of the project proposal I submitted. It was hard to
find time during the final weeks of the semester to greatly explore the
(SP-)GiST interfaces, but given the responses here, it seems the
integrated implementation is clearly beyond scope for a summer project,
which I agree with. Instead, I proposed my original plan that can surely be
accomplished over the summer. Coincidentally enough, it is in essence, what
Florian Pflug and the rest have discussed here.

In short, I will use only the btree in postgresql to store single
dimensional values mapped to multi-dimensional point data, and then query
ranges of these values in the btree based on partition information stored
separately. The information can be gathered upfront and periodically
updated as needed, which done properly will not require downtime or
reshuffling of the btree. This will result in a fully useable index, and
the project will also include a performance and usability assessment at the
end.

I still believe this would be an excellent GSoC project that I am highly
motivated to make fully accessible to the community