Re: [PERFORM] index structure for 114-dimension vector
Let me just thank the list, especially for the references. (I found similar papers myself with Google: and to think I have a university library alumni card and barely need it any more!) I'll write again on the sorts of results I get. BEGIN:VCARD VERSION:2.1 N:Lazarus;Andrew;;;Ph.D. FN:Andrew Lazarus, Ph.D. EMAIL;PREF;INTERNET:[EMAIL PROTECTED] TITLE:Director of RD ADR;WORK:;800-366-0688;3028 Fillmore Street;San Francisco;CA;94123;USA LABEL;WORK;ENCODING=QUOTED-PRINTABLE:800-366-0688=0D=0A3028 Fillmore S= treet=0D=0ASan Francisco=0D=0ACA=0D=0A94123=0D=0AUSA X-GENDER:Male REV:18991230T08Z END:VCARD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] index structure for 114-dimension vector
On 5/1/07, Andrew Lazarus [EMAIL PROTECTED] wrote: Let me just thank the list, especially for the references. (I found similar papers myself with Google: and to think I have a university library alumni card and barely need it any more!) I'll write again on the sorts of results I get. Looking forward to hearing about them. I have worked with such dataset problems, but never attempted to apply them to a relational database such as PostgreSQL. If you want speed, nothing beats in-memory vectors on a modern CPU architecture. Alexander. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] index structure for 114-dimension vector
On 21-4-2007 1:42 Mark Kirkwood wrote: I don't think that will work for the vector norm i.e: |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) I don't know if this is usefull here, but I was able to rewrite that algorithm for a set of very sparse vectors (i.e. they had very little overlapping factors) to something like: |x - y| = sum over j (x[j]^2) + sum over j (y[j]^2) + for each j where x[j] and y[j] are both non-zero: - (x[j]^2 + y[j]^2) + (x[j] - y[j])^2 The first two parts sums can be calculated only once. So if you have very little overlap, this is therefore much more efficient (if there is no overlap at all you end up with x[j]^2 + y[j]^2 anyway). Besides, this rewritten calculation allows you to store the X and Y vectors using a trivial table-layout vector(x,i,value) which is only filled with non-zero's and which you can trivially self-join to find the closest matches. You don't care about the j's where there is either no x or y-value anyway with this rewrite. I can compare over 1000 y's of on average 100 elements to two x's of over 1000 elements on just a single 1.8Ghz amd processor. (I use it for a bi-kmeans algorithm, so there are only two buckets to compare to). So it might be possible to rewrite your algorithm to be less calculation-intensive. Obviously, with a dense-matrix this isn't going to work, but there may be other ways to circumvent parts of the algorithm or to cache large parts of it. It might also help to extract only the 6 relevant columns into a seperate temporary table which will have much smaller records and thus can fit more records per page. Best regards, Arjen ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] index structure for 114-dimension vector
On Apr 20, 12:07 pm, [EMAIL PROTECTED] (Andrew Lazarus) wrote: I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usualvectornorm. Speed is critical here, and everything I have tried has been too slow. I imported the cube contrib package, and I tried creating an index on a cube of the last 6 elements, which are the most important. Then I tested the 2.5MM rows for being contained within a tolerance of the last 6 elements of X, +/- 0.1 in each coordinate, figuring that would be an indexed search (which I CLUSTERED on). I then ran the sort on this smaller set. The index was used, but it was still too slow. I also tried creating new columns with rounded int2 values of the last 6 coordinates and made a multicolumn index. For each X the search is taking about 4-15 seconds which is above my target at least one order of magnitude. Absolute numbers are dependent on my hardware and settings, and some of this can be addressed with configuration tweaks, etc., but first I think I need to know the optimum data structure/indexing strategy. Is anyone on the list experienced with this sort of issue? Thanks. Andrew Lazarus [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq Having worked in high dimensional spaces a lot in my career I think you'll find that there are mathematical limits in terms of speed. In practical terms, a seq_scan will be unavoidable since on first approximation you are limited to doing an exhaustive search in 101-dimensional space unless you make approximations or dimensionality reductions of some kind. Read up on the Curse of Dimensionality: http://en.wikipedia.org/wiki/Curse_of_dimensionality Have you considered dimension reduction techniques such as Singular Value Decomposition, Principal Components Analysis, etc.? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] index structure for 114-dimension vector
On 4/20/07, Andrew Lazarus [EMAIL PROTECTED] wrote: I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow. Let me chime in with the observation that this is a multidimensional nearest neighbour (reverse nearest neighbour and its close cousin, k-NN) that is well known in statistics, and particularly relevant to statistical learning and classification. Knowing the jargon might help you dig up efficient algorithms to mine your data; there are tons of fascinating papers available through Citeseer. In particular, I recommend the paper Efficient k-NN Search on Vertically Decomposed Data by de Vries et al, SIGMOD 2002 (PDF here: http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It proposes an algorithm called BOND to drastically reduce the search space by probalistic means. They give an example using image histograms, but the algorithm applies to any multidimensional data. Briefly put, it points out that proximity comparison can be computed vertically, a few dimensions at a time, and entire subsets can be thrown away when it's apparent that they are below a statistically derived lower bound. The only gotcha is that the algorithm derives much of its performance from the assumption that your data is vertically decomposed, one table per dimension, otherwise the search effectively incurs a sequential scan of the entire dataset, and then you're pretty much back to square one. The most common approach to nearest neighbour search is to use a spatial data structure. The classic algorithm is the kd-tree (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B tree, neither of which are available in PostgreSQL. If I remember correctly, R-trees have also been shown to be useful for high numbers of dimensions; with PostgreSQL you have R-trees and even better R-tree-equivalent support through GiST. I have no idea whether you can actually munge your integer vectors into something GiST can index and search, but it's a thought. (GiST, presumably, can also theoretically index kd-trees and other spatial trees.) Alexander. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] index structure for 114-dimension vector
On Fri, 27 Apr 2007, Alexander Staubo wrote: On 4/20/07, Andrew Lazarus [EMAIL PROTECTED] wrote: I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow. Let me chime in with the observation that this is a multidimensional nearest neighbour (reverse nearest neighbour and its close cousin, k-NN) that is well known in statistics, and particularly relevant to statistical learning and classification. Knowing the jargon might help you dig up efficient algorithms to mine your data; there are tons of fascinating papers available through Citeseer. In particular, I recommend the paper Efficient k-NN Search on Vertically Decomposed Data by de Vries et al, SIGMOD 2002 (PDF here: http://citeseer.ist.psu.edu/618138.html), if only for inspiration. It proposes an algorithm called BOND to drastically reduce the search space by probalistic means. They give an example using image histograms, but the algorithm applies to any multidimensional data. Briefly put, it points out that proximity comparison can be computed vertically, a few dimensions at a time, and entire subsets can be thrown away when it's apparent that they are below a statistically derived lower bound. The only gotcha is that the algorithm derives much of its performance from the assumption that your data is vertically decomposed, one table per dimension, otherwise the search effectively incurs a sequential scan of the entire dataset, and then you're pretty much back to square one. The most common approach to nearest neighbour search is to use a spatial data structure. The classic algorithm is the kd-tree (http://en.wikipedia.org/wiki/Kd-tree) and there's the newer K-D-B tree, neither of which are available in PostgreSQL. If I remember correctly, R-trees have also been shown to be useful for high numbers of dimensions; with PostgreSQL you have R-trees and even better R-tree-equivalent support through GiST. I have no idea whether you can actually munge your integer vectors into something GiST can index and search, but it's a thought. (GiST, presumably, can also theoretically index kd-trees and other spatial trees.) you're right, but currently only theoretically due to interface restriction. We have plan to improve it sometime. There was SP-GiST project, which could be used for k-d-b tree, see http://www.cs.purdue.edu/spgist/ I don't know if it works with 8.2 version. Also, it doesn't supports concurrency and recovery Alexander. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[PERFORM] index structure for 114-dimension vector
I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow. I imported the cube contrib package, and I tried creating an index on a cube of the last 6 elements, which are the most important. Then I tested the 2.5MM rows for being contained within a tolerance of the last 6 elements of X, +/- 0.1 in each coordinate, figuring that would be an indexed search (which I CLUSTERED on). I then ran the sort on this smaller set. The index was used, but it was still too slow. I also tried creating new columns with rounded int2 values of the last 6 coordinates and made a multicolumn index. For each X the search is taking about 4-15 seconds which is above my target at least one order of magnitude. Absolute numbers are dependent on my hardware and settings, and some of this can be addressed with configuration tweaks, etc., but first I think I need to know the optimum data structure/indexing strategy. Is anyone on the list experienced with this sort of issue? Thanks. Andrew Lazarus [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] index structure for 114-dimension vector
On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote: I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow. I imported the cube contrib package, and I tried creating an index on a cube of the last 6 elements, which are the most important. Then I Have you tried just making normal indexes on each of the last 6 elements and see if postgresql will use a BitmapAnd to combine them? Regards, Jeff Davis ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] index structure for 114-dimension vector
Jeff Davis wrote: On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote: I have a table with 2.5 million real[] arrays. (They are points in a time series.) Given a new array X, I'd like to find, say, the 25 closest to X in some sense--for simplification, let's just say in the usual vector norm. Speed is critical here, and everything I have tried has been too slow. I imported the cube contrib package, and I tried creating an index on a cube of the last 6 elements, which are the most important. Then I Have you tried just making normal indexes on each of the last 6 elements and see if postgresql will use a BitmapAnd to combine them? I don't think that will work for the vector norm i.e: |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) Cheers Mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] index structure for 114-dimension vector
Because I know the 25 closest are going to be fairly close in each coordinate, I did try a multicolumn index on the last 6 columns and used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside that hypercube on the distribution of data in question.) This hypercube tended to have 10-20K records, and took at least 4 seconds to retrieve. I was a little surprised by how long that took. So I'm wondering if my data representation is off the wall. I should mention I also tried a cube index using gist on all 114 elements, but CREATE INDEX hadn't finished in 36 hours, when I killed it, and I wasn't in retrospect sure an index that took something like 6GB by itself would be helpful on a 2GB of RAM box. MK I don't think that will work for the vector norm i.e: MK |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) MK Cheers MK Mark -- Sincerely, Andrew Lazarusmailto:[EMAIL PROTECTED]BEGIN:VCARD VERSION:2.1 N:Lazarus;Andrew;;;Ph.D. FN:Andrew Lazarus, Ph.D. EMAIL;PREF;INTERNET:[EMAIL PROTECTED] TITLE:Director of RD ADR;WORK:;800-366-0688;3028 Fillmore Street;San Francisco;CA;94123;USA LABEL;WORK;ENCODING=QUOTED-PRINTABLE:800-366-0688=0D=0A3028 Fillmore S= treet=0D=0ASan Francisco=0D=0ACA=0D=0A94123=0D=0AUSA X-GENDER:Male REV:18991230T08Z END:VCARD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] index structure for 114-dimension vector
Andrew Lazarus wrote: Because I know the 25 closest are going to be fairly close in each coordinate, I did try a multicolumn index on the last 6 columns and used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside that hypercube on the distribution of data in question.) This hypercube tended to have 10-20K records, and took at least 4 seconds to retrieve. I was a little surprised by how long that took. So I'm wondering if my data representation is off the wall. I should mention I also tried a cube index using gist on all 114 elements, but CREATE INDEX hadn't finished in 36 hours, when I killed it, and I wasn't in retrospect sure an index that took something like 6GB by itself would be helpful on a 2GB of RAM box. MK I don't think that will work for the vector norm i.e: MK |x - y| = sqrt(sum over j ((x[j] - y[j])^2)) Sorry, in that case it probably *is* worth trying out 6 single column indexes and seeing if they get bitmap and'ed together... Mark ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] index structure for 114-dimension vector
Andrew Lazarus [EMAIL PROTECTED] writes: Because I know the 25 closest are going to be fairly close in each coordinate, I did try a multicolumn index on the last 6 columns and used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside that hypercube on the distribution of data in question.) This hypercube tended to have 10-20K records, and took at least 4 seconds to retrieve. I was a little surprised by how long that took. So I'm wondering if my data representation is off the wall. A multicolumn btree index isn't going to be helpful at all. Jeff's idea of using six single-column indexes with the above query might work, though. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate