Re: [HACKERS] BK-Tree Implementation on top of GiST

2008-02-02 Thread Volkan Yazıcı
Florian Weimer <[EMAIL PROTECTED]> writes:
> http://citeseer.ist.psu.edu/1593.html suggests that this uninteresting
> (too much of the database is examined) once you go past an edit distance
> of 1.  I don't know if this is a problem in your case (it is in mine).

Did you see the test results in bk-tree[1] project? Results will
change with respect to metric distance distribution of your input
data, but I was quite impressed by the numbers when I first saw them.

[1] http://www.cliki.net/bk-tree


Regards.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] BK-Tree Implementation on top of GiST

2007-10-28 Thread Oleg Bartunov

Have you seen contrib/pg_trgm module ?

Oleg
On Sun, 28 Oct 2007, Volkan YAZICI wrote:


Hi,

In an address search framework of a company, we need to deal with
queries including potential spelling errors. After some external
address space constraints (e.g. match first letters, word length,
etc.) we're still ending up with a huge data set to filter through
Levenshtein like distance metrics.

Sequential scanning a record set with roughly 100,000 entries through
some sort of distance metric is not something we'd want in
production. For this purpose, I plan to implement BK-Trees[1] on top
of GiST, which will (technically) reduce overhead complexity from O(n)
to O(logn). As far as I'm concerned, such a work will worth the time
it will take when compared to overhead reduction it will bring.

[1] Some approaches to best-match file searching
   http://portal.acm.org/citation.cfm?id=362003.362025

Anyway, I have some experience with source code of intarray
module. Does anybody have suggestions/warnings/comments about such a
project? Would PostgreSQL team welcome such a patch to get integrated
into fuzzystrmatch module, or should I create my own project at
pgfoundry?

BTW, as you'd imagine, related implementation won't be something
specific to Levenshtein. Any distance metric on any kind of data will
be able to benefit from BK-Trees.


Regards.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] BK-Tree Implementation on top of GiST

2007-10-28 Thread Florian Weimer
* Volkan YAZICI:

> [1] Some approaches to best-match file searching
> http://portal.acm.org/citation.cfm?id=362003.362025

http://citeseer.ist.psu.edu/1593.html suggests that this uninteresting
(too much of the database is examined) once you go past an edit distance
of 1.  I don't know if this is a problem in your case (it is in mine).

It's a pity that this whole set of problems is still mostly unsolved.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly