Re: [PERFORM] index structure for 114-dimension vector

2007-05-01 Thread Andrew Lazarus
Let me just thank the list, especially for the references. (I found
similar papers myself with Google: and to think I have a university
library alumni card and barely need it any more!)

I'll write again on the sorts of results I get.
BEGIN:VCARD
VERSION:2.1
N:Lazarus;Andrew;;;Ph.D.
FN:Andrew Lazarus, Ph.D.
EMAIL;PREF;INTERNET:[EMAIL PROTECTED]
TITLE:Director of RD
ADR;WORK:;800-366-0688;3028 Fillmore Street;San Francisco;CA;94123;USA
LABEL;WORK;ENCODING=QUOTED-PRINTABLE:800-366-0688=0D=0A3028 Fillmore S=
 treet=0D=0ASan Francisco=0D=0ACA=0D=0A94123=0D=0AUSA
X-GENDER:Male
REV:18991230T08Z
END:VCARD
---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-05-01 Thread Alexander Staubo

On 5/1/07, Andrew Lazarus [EMAIL PROTECTED] wrote:

Let me just thank the list, especially for the references. (I found
similar papers myself with Google: and to think I have a university
library alumni card and barely need it any more!)

I'll write again on the sorts of results I get.


Looking forward to hearing about them. I have worked with such dataset
problems, but never attempted to apply them to a relational database
such as PostgreSQL. If you want speed, nothing beats in-memory vectors
on a modern CPU architecture.

Alexander.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] index structure for 114-dimension vector

2007-04-27 Thread Arjen van der Meijden

On 21-4-2007 1:42 Mark Kirkwood wrote:

I don't think that will work for the vector norm i.e:

|x - y| = sqrt(sum over j ((x[j] - y[j])^2))


I don't know if this is usefull here, but I was able to rewrite that 
algorithm for a set of very sparse vectors (i.e. they had very little 
overlapping factors) to something like:

|x - y| = sum over j (x[j]^2) + sum over j (y[j]^2)
   + for each j where x[j] and y[j] are both non-zero: - (x[j]^2 + 
y[j]^2) + (x[j] - y[j])^2


The first two parts sums can be calculated only once. So if you have 
very little overlap, this is therefore much more efficient (if there is 
no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this 
rewritten calculation allows you to store the X and Y vectors using a 
trivial table-layout vector(x,i,value) which is only filled with 
non-zero's and which you can trivially self-join to find the closest 
matches. You don't care about the j's where there is either no x or 
y-value anyway with this rewrite.


I can compare over 1000 y's of on average 100 elements to two x's of 
over 1000 elements on just a single 1.8Ghz amd processor. (I use it for 
a bi-kmeans algorithm, so there are only two buckets to compare to).


So it might be possible to rewrite your algorithm to be less 
calculation-intensive. Obviously, with a dense-matrix this isn't going 
to work, but there may be other ways to circumvent parts of the 
algorithm or to cache large parts of it.
It might also help to extract only the 6 relevant columns into a 
seperate temporary table which will have much smaller records and thus 
can fit more records per page.


Best regards,

Arjen

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] index structure for 114-dimension vector

2007-04-26 Thread C Storm
On Apr 20, 12:07 pm, [EMAIL PROTECTED] (Andrew Lazarus) wrote:
 I have a table with 2.5 million real[] arrays. (They are points in a
 time series.) Given a new array X, I'd like to find, say, the 25
 closest to X in some sense--for simplification, let's just say in the
 usualvectornorm. Speed is critical here, and everything I have tried
 has been too slow.

 I imported the cube contrib package, and I tried creating an index on
 a cube of the last 6 elements, which are the most important. Then I
 tested the 2.5MM rows for being contained within a tolerance of the
 last 6 elements of X, +/- 0.1 in each coordinate, figuring that would
 be an indexed search (which I CLUSTERED on). I then ran the sort on
 this smaller set. The index was used, but it was still too slow. I
 also tried creating new columns with rounded int2 values of the last 6
 coordinates and made a multicolumn index.

 For each X the search is taking about 4-15 seconds which is above my
 target at least one order of magnitude. Absolute numbers are dependent
 on my hardware and settings, and some of this can be addressed with
 configuration tweaks, etc., but first I think I need to know the
 optimum data structure/indexing strategy.

 Is anyone on the list experienced with this sort of issue?

 Thanks.
 Andrew Lazarus  [EMAIL PROTECTED]

 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Having worked in high dimensional spaces a lot in my career I think
you'll find that there are
mathematical limits in terms of speed.  In practical terms, a seq_scan
will be unavoidable since
on first approximation you are limited to doing an exhaustive search
in 101-dimensional space unless
you make approximations or dimensionality reductions of some kind.

Read up on the Curse of Dimensionality:  
http://en.wikipedia.org/wiki/Curse_of_dimensionality

Have you considered dimension reduction techniques such as Singular
Value Decomposition,
Principal Components Analysis, etc.?


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-04-26 Thread Alexander Staubo

On 4/20/07, Andrew Lazarus [EMAIL PROTECTED] wrote:

I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.


Let me chime in with the observation that this is a multidimensional
nearest neighbour (reverse nearest neighbour and its close cousin,
k-NN) that is well known in statistics, and particularly relevant to
statistical learning and classification. Knowing the jargon might help
you dig up efficient algorithms to mine your data; there are tons of
fascinating papers available through Citeseer.

In particular, I recommend the paper Efficient k-NN Search on
Vertically Decomposed Data by de Vries et al, SIGMOD 2002 (PDF here:
http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
proposes an algorithm called BOND to drastically reduce the search
space by probalistic means. They give an example using image
histograms, but the algorithm applies to any multidimensional data.
Briefly put, it points out that proximity comparison can be computed
vertically, a few dimensions at a time, and entire subsets can be
thrown away when it's apparent that they are below a statistically
derived lower bound. The only gotcha is that the algorithm derives
much of its performance from the assumption that your data is
vertically decomposed, one table per dimension, otherwise the search
effectively incurs a sequential scan of the entire dataset, and then
you're pretty much back to square one.

The most common approach to nearest neighbour search is to use a
spatial data structure. The classic algorithm is the kd-tree
(http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
tree, neither of which are available in PostgreSQL. If I remember
correctly, R-trees have also been shown to be useful for high numbers
of dimensions; with PostgreSQL you have R-trees and even better
R-tree-equivalent support through GiST. I have no idea whether you can
actually munge your integer vectors into something GiST can index and
search, but it's a thought. (GiST, presumably, can also theoretically
index kd-trees and other spatial trees.)

Alexander.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] index structure for 114-dimension vector

2007-04-26 Thread Oleg Bartunov

On Fri, 27 Apr 2007, Alexander Staubo wrote:


On 4/20/07, Andrew Lazarus [EMAIL PROTECTED] wrote:

I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.


Let me chime in with the observation that this is a multidimensional
nearest neighbour (reverse nearest neighbour and its close cousin,
k-NN) that is well known in statistics, and particularly relevant to
statistical learning and classification. Knowing the jargon might help
you dig up efficient algorithms to mine your data; there are tons of
fascinating papers available through Citeseer.

In particular, I recommend the paper Efficient k-NN Search on
Vertically Decomposed Data by de Vries et al, SIGMOD 2002 (PDF here:
http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It
proposes an algorithm called BOND to drastically reduce the search
space by probalistic means. They give an example using image
histograms, but the algorithm applies to any multidimensional data.
Briefly put, it points out that proximity comparison can be computed
vertically, a few dimensions at a time, and entire subsets can be
thrown away when it's apparent that they are below a statistically
derived lower bound. The only gotcha is that the algorithm derives
much of its performance from the assumption that your data is
vertically decomposed, one table per dimension, otherwise the search
effectively incurs a sequential scan of the entire dataset, and then
you're pretty much back to square one.

The most common approach to nearest neighbour search is to use a
spatial data structure. The classic algorithm is the kd-tree
(http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B
tree, neither of which are available in PostgreSQL. If I remember
correctly, R-trees have also been shown to be useful for high numbers
of dimensions; with PostgreSQL you have R-trees and even better
R-tree-equivalent support through GiST. I have no idea whether you can
actually munge your integer vectors into something GiST can index and
search, but it's a thought. (GiST, presumably, can also theoretically
index kd-trees and other spatial trees.)


you're right, but currently  only theoretically due to interface restriction.
We have plan to improve it sometime. There was SP-GiST project, which 
could be used for k-d-b tree,  see http://www.cs.purdue.edu/spgist/

I don't know if it works with 8.2 version. Also, it doesn't supports
concurrency and recovery



Alexander.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

 http://archives.postgresql.org



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


[PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Andrew Lazarus
I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.

I imported the cube contrib package, and I tried creating an index on
a cube of the last 6 elements, which are the most important. Then I
tested the 2.5MM rows for being contained within a tolerance of the
last 6 elements of X, +/- 0.1 in each coordinate, figuring that would
be an indexed search (which I CLUSTERED on). I then ran the sort on
this smaller set. The index was used, but it was still too slow. I
also tried creating new columns with rounded int2 values of the last 6
coordinates and made a multicolumn index.

For each X the search is taking about 4-15 seconds which is above my
target at least one order of magnitude. Absolute numbers are dependent
on my hardware and settings, and some of this can be addressed with
configuration tweaks, etc., but first I think I need to know the
optimum data structure/indexing strategy.

Is anyone on the list experienced with this sort of issue?

Thanks.
Andrew Lazarus  [EMAIL PROTECTED]


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Jeff Davis
On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote:
 I have a table with 2.5 million real[] arrays. (They are points in a
 time series.) Given a new array X, I'd like to find, say, the 25
 closest to X in some sense--for simplification, let's just say in the
 usual vector norm. Speed is critical here, and everything I have tried
 has been too slow.
 
 I imported the cube contrib package, and I tried creating an index on
 a cube of the last 6 elements, which are the most important. Then I

Have you tried just making normal indexes on each of the last 6 elements
and see if postgresql will use a BitmapAnd to combine them? 

Regards,
Jeff Davis




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Mark Kirkwood

Jeff Davis wrote:

On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote:

I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.

I imported the cube contrib package, and I tried creating an index on
a cube of the last 6 elements, which are the most important. Then I


Have you tried just making normal indexes on each of the last 6 elements
and see if postgresql will use a BitmapAnd to combine them? 





I don't think that will work for the vector norm i.e:

|x - y| = sqrt(sum over j ((x[j] - y[j])^2))


Cheers

Mark

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Andrew Lazarus
Because I know the 25 closest are going to be fairly close in each
coordinate, I did try a multicolumn index on the last 6 columns and
used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
that hypercube on the distribution of data in question.)

This hypercube tended to have 10-20K records, and took at least 4
seconds to retrieve. I was a little surprised by how long that took.
So I'm wondering if my data representation is off the wall.

I should mention I also tried a cube index using gist on all 114
elements, but CREATE INDEX hadn't finished in 36 hours, when I killed
it, and I wasn't in retrospect sure an index that took something like
6GB by itself would be helpful on a 2GB of RAM box.

MK I don't think that will work for the vector norm i.e:

MK |x - y| = sqrt(sum over j ((x[j] - y[j])^2))


MK Cheers

MK Mark


-- 
Sincerely,
 Andrew Lazarusmailto:[EMAIL PROTECTED]BEGIN:VCARD
VERSION:2.1
N:Lazarus;Andrew;;;Ph.D.
FN:Andrew Lazarus, Ph.D.
EMAIL;PREF;INTERNET:[EMAIL PROTECTED]
TITLE:Director of RD
ADR;WORK:;800-366-0688;3028 Fillmore Street;San Francisco;CA;94123;USA
LABEL;WORK;ENCODING=QUOTED-PRINTABLE:800-366-0688=0D=0A3028 Fillmore S=
 treet=0D=0ASan Francisco=0D=0ACA=0D=0A94123=0D=0AUSA
X-GENDER:Male
REV:18991230T08Z
END:VCARD
---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Mark Kirkwood

Andrew Lazarus wrote:

Because I know the 25 closest are going to be fairly close in each
coordinate, I did try a multicolumn index on the last 6 columns and
used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
that hypercube on the distribution of data in question.)

This hypercube tended to have 10-20K records, and took at least 4
seconds to retrieve. I was a little surprised by how long that took.
So I'm wondering if my data representation is off the wall.

I should mention I also tried a cube index using gist on all 114
elements, but CREATE INDEX hadn't finished in 36 hours, when I killed
it, and I wasn't in retrospect sure an index that took something like
6GB by itself would be helpful on a 2GB of RAM box.

MK I don't think that will work for the vector norm i.e:

MK |x - y| = sqrt(sum over j ((x[j] - y[j])^2))




Sorry, in that case it probably *is* worth trying out 6 single column 
indexes and seeing if they get bitmap and'ed together...


Mark

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Tom Lane
Andrew Lazarus [EMAIL PROTECTED] writes:
 Because I know the 25 closest are going to be fairly close in each
 coordinate, I did try a multicolumn index on the last 6 columns and
 used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
 that hypercube on the distribution of data in question.)

 This hypercube tended to have 10-20K records, and took at least 4
 seconds to retrieve. I was a little surprised by how long that took.
 So I'm wondering if my data representation is off the wall.

A multicolumn btree index isn't going to be helpful at all.  Jeff's idea
of using six single-column indexes with the above query might work,
though.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate