How to deal with 96 Dimensional Points ?
Hello, I have been pondering this for a while, but never really looked deeply into the problem. I have 96 dimensional points and I would like to pose queries such as: 'give me all points that are within such a radius of this one'. The gis extensions to mysql might support such type of query. The problem is of course that points are 2 dimensional and I'm not sure whether I can extend it to more than 3 dimensions ? Does anybody have an idea about this ? Wkr, -- http://werner.yellowcouch.org/ signature.asc Description: OpenPGP digital signature
Re: How to deal with 96 Dimensional Points ?
Geert-Jan Brits wrote: You're most likely talking about something like consine-similarity on N-dimensional vectors. http://en.wikipedia.org/wiki/Cosine_similarity http://stackoverflow.com/search?q=cosine+similarity Cool links ! Although it is not why I need it for. I'm really talking about an eucledian distance measure between vectors. So in a sense it is simpler. Normally the gis extensions already provide the basic tools necessary, if only the points could be extended to more than 2 dimensions. You could google to see strategies that exist for mysql. However, depending on your use-case (e.g: scalable recommender systems) mysql (or any other rdbms) may not be the best tool for the job. -- http://werner.yellowcouch.org/ signature.asc Description: OpenPGP digital signature
Re: How to deal with 96 Dimensional Points ?
Johan De Meersman wrote: Well... a point in an n-dimensional space, is a location that has a defined value for each of it's n dimensions. If you have a value for each of your 96 dimensions, you have a point. Well, it's fairly simple. If you have two points with 96 values in each. Point1=(x1,...x96) and Point2=(y1,...,y96). The distance between these two is d=sqrt( (x_1-y_1)^2 + ... + (x_96-y_96)^2 ) There is no magic in this. The mathematics of comparing distances in 96 dimensions is beyond me, though :-) I guess a good start would be looking at comparing distances in 2 and 3 dimensions (vector math, that is) and trying to extrapolate a method from that. Alternatively, hire a mathematician :-p Extrapolating from lower dimensions doesn't work too well. In this case this would mean storing 48 different points and then trying to define a distance measure based on each individual point. I'm not sure this is feasable. In general: KD-trees are quite good tools to deal with such large dimensional spaces, but I see no possibility to use them in mysql, Wkr, On Tue, Mar 30, 2010 at 11:39 AM, Werner Van Belle wer...@yellowcouch.org mailto:wer...@yellowcouch.org wrote: Hello, I have been pondering this for a while, but never really looked deeply into the problem. I have 96 dimensional points and I would like to pose queries such as: 'give me all points that are within such a radius of this one'. The gis extensions to mysql might support such type of query. The problem is of course that points are 2 dimensional and I'm not sure whether I can extend it to more than 3 dimensions ? Does anybody have an idea about this ? Wkr, -- http://werner.yellowcouch.org/ -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- http://werner.yellowcouch.org/ signature.asc Description: OpenPGP digital signature
Re: How to deal with 96 Dimensional Points ?
I'm not sure why, but it seems that some people, I don't mean to imply that you are one of them, think there is some magic MySQL can preform to find points with in a given radius using the GIS extension. There is no magic. They simply use the well known math required to determine what points are inside the circle. I could be wrong but I doubt there is any way to create an index that can directly indicate points with a a certain distance of other points unless that index included the distance from every point to every other point. That is obviously not practical since with a set of only 14 points the index would have over 6 billion entries. lets call each of your dimensions d1, d2, d3 d96. If you create an index on d1, d2, d69, you can then create a simple query that will quickly find all points that will find all points that are with in a bounding box. Since this query is going to get a bit large with 96 dimensions, I would use code to create the query. I will use php. Let's start with the desired radius being r and the test point dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . . $select = 'SELECT `PointID`, '; $where = 'WHERE '; foreach($TestPointD as $i = $d){ $di = 'd' . $i; $select .= `$di`, $MinD = $d - $r; $MaxD = $d + $r; $where .= `$di` = '$MinD' AND `$di` = '$MaxD' AND ; } $select = substr($select, 0, -2); //trim of the trailing comma and space $where = substr($where, 0, -4); //trim off the trailing 'AND ' $query = $select FROM `points` $where; Obviously this is going to give you points outside the sphere but still inside the cube. However it will narrow down the set so the further math will not take as long. If this were 3 dimensions with an uniform distribution of points, about 52% of the points returned by that query will be inside the sphere. I'm not sure how to calculate the ratio of the volume sphere to a cube in 96 dimensions.Then it will be a simple loop to find the points you really want. While this query will likely return a lot of points that you don't want especially in 96D space, it will reduce it enough that the following loop will be much faster than looking all points in the table. $result = mysql_query($query) or die(DB error $query . mysql_error() ); while(($row = mysql_fetch_row($result))){ $sum foreach($row as $i = $d){ if($i == 0){ $PointID = $d; continue; // skip point id at $row[0] } $SumSq += pow($TestPointD[$i] - $d, 2); } if(sqrt($SumSq) = $r){ print $PointID is with in $r of test point.\n; } } In an application I had that was similar (but in 2D) I would insert the id of the points that passed the condition into a temp table. Then I could join that temp table to other tables do other queries I may need on those points. Chris W Werner Van Belle wrote: Hello, I have been pondering this for a while, but never really looked deeply into the problem. I have 96 dimensional points and I would like to pose queries such as: 'give me all points that are within such a radius of this one'. The gis extensions to mysql might support such type of query. The problem is of course that points are 2 dimensional and I'm not sure whether I can extend it to more than 3 dimensions ? Does anybody have an idea about this ? Wkr, -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: How to deal with 96 Dimensional Points ?
Perhaps you could give us a (generalized) description of your use-case, so we can better grasp what you want to achieve, and how you want to use it. i.e: since I can't imagine/ envison a real 'eucledian distance' over 96 dimensions I bet you're talking a generalized distance function over N dimenions. This is usually only used in two general ways afaik: 1 calculating all points that lie within a certain threshold (Chris' implementation prints these points) 2 calculating an ordered top-M list of closests points to the target point (Chris' implementation slightly altered) 3 (hmm maybe three: clustering points based on their distance to eachother) It helps if we know what you'r after. For instance: if you're points don't change often and you want to achieve case 1 or 2 I would calculate these once and all-at-once and save them in a seperate table, bc. the on-demand variant may quickly become too slow. again depending on your case: option A. {pointid, {neighborids}} -- list of neigborids per point id, with pointid as key. option B {pointid, neighborid} -- one neighborid per point id, with pointid + neighborid as key. perhaps also helpful foor google etc.: - a distance function if more often called a similarity function - top-n 'points' for a given point are usually called its neighbors. - in most cases you don't have to take the sqrt in Chris' implementation which can save a lot (but instead do: if($SumSq =($r*$r)){//code here} 2010/3/30 Chris W 4rfv...@cox.net I'm not sure why, but it seems that some people, I don't mean to imply that you are one of them, think there is some magic MySQL can preform to find points with in a given radius using the GIS extension. There is no magic. They simply use the well known math required to determine what points are inside the circle. I could be wrong but I doubt there is any way to create an index that can directly indicate points with a a certain distance of other points unless that index included the distance from every point to every other point. That is obviously not practical since with a set of only 14 points the index would have over 6 billion entries. lets call each of your dimensions d1, d2, d3 d96. If you create an index on d1, d2, d69, you can then create a simple query that will quickly find all points that will find all points that are with in a bounding box. Since this query is going to get a bit large with 96 dimensions, I would use code to create the query. I will use php. Let's start with the desired radius being r and the test point dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . . $select = 'SELECT `PointID`, '; $where = 'WHERE '; foreach($TestPointD as $i = $d){ $di = 'd' . $i; $select .= `$di`, $MinD = $d - $r; $MaxD = $d + $r; $where .= `$di` = '$MinD' AND `$di` = '$MaxD' AND ; } $select = substr($select, 0, -2); //trim of the trailing comma and space $where = substr($where, 0, -4); //trim off the trailing 'AND ' $query = $select FROM `points` $where; Obviously this is going to give you points outside the sphere but still inside the cube. However it will narrow down the set so the further math will not take as long. If this were 3 dimensions with an uniform distribution of points, about 52% of the points returned by that query will be inside the sphere. I'm not sure how to calculate the ratio of the volume sphere to a cube in 96 dimensions.Then it will be a simple loop to find the points you really want. While this query will likely return a lot of points that you don't want especially in 96D space, it will reduce it enough that the following loop will be much faster than looking all points in the table. $result = mysql_query($query) or die(DB error $query . mysql_error() ); while(($row = mysql_fetch_row($result))){ $sum foreach($row as $i = $d){ if($i == 0){ $PointID = $d; continue; // skip point id at $row[0] } $SumSq += pow($TestPointD[$i] - $d, 2); } if(sqrt($SumSq) = $r){ print $PointID is with in $r of test point.\n; } } In an application I had that was similar (but in 2D) I would insert the id of the points that passed the condition into a temp table. Then I could join that temp table to other tables do other queries I may need on those points. Chris W Werner Van Belle wrote: Hello, I have been pondering this for a while, but never really looked deeply into the problem. I have 96 dimensional points and I would like to pose queries such as: 'give me all points that are within such a radius of this one'. The gis extensions to mysql might support such type of query. The problem is of course that points are 2 dimensional and I'm not sure whether I can extend it to more than 3 dimensions ? Does anybody have an idea about this ? Wkr, -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:
Re: How to deal with 96 Dimensional Points ?
Hello Chris, The use case I' m talking about is actually a typical usecase for GIS applications: give me the x closest points to this one. E.g: give me the 10 points closest to (1,2,79) or in my case: give me the 100 points closest to (x1,x96). A query like yours might be possible and might be a good solution if we would know the radius in which we are looking for the points, but this is not really the case: we merely want a list returned ordered by distance. Solving this with your solution is possible but is quite slow. There exists nice datastructures to deal with this type of problem as said and these are used in the GIS implementation in MySql. Chris W wrote: I'm not sure why, but it seems that some people, I don't mean to imply that you are one of them, think there is some magic MySQL can preform to find points with in a given radius using the GIS extension. There is no magic. They simply use the well known math required to determine what points are inside the circle. GIS extenstions are also not only about distances: the above query is better solved with specialized datastructures. I could be wrong but I doubt there is any way to create an index that can directly indicate points with a a certain distance of other points unless that index included the distance from every point to every other point. That is obviously not practical since with a set of only 14 points the index would have over 6 billion entries. Partitioning of the space such as done in 3D render engines do solve this problem more efficiently than having a list of all pairtwise distances. So the question is not whether such algorithms exist, it is rather whether they are available in/through MySql. lets call each of your dimensions d1, d2, d3 d96. If you create an index on d1, d2, d69, you can then create a simple query that will quickly find all points that will find all points that are with in a bounding box. Since this query is going to get a bit large with 96 dimensions, I would use code to create the query. I will use php. Let's start with the desired radius being r and the test point dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . . $select = 'SELECT `PointID`, '; $where = 'WHERE '; foreach($TestPointD as $i = $d){ $di = 'd' . $i; $select .= `$di`, $MinD = $d - $r; $MaxD = $d + $r; $where .= `$di` = '$MinD' AND `$di` = '$MaxD' AND ; } $select = substr($select, 0, -2); //trim of the trailing comma and space $where = substr($where, 0, -4); //trim off the trailing 'AND ' $query = $select FROM `points` $where; Thanks for the nice illustration. In this case with the proper indices this will indeed split the space in sections; nevertheless this approach has great difficulties returning an ordered list of distances and prefereably only the 100 closest ones at that. Wkr, -- http://werner.yellowcouch.org/ signature.asc Description: OpenPGP digital signature
Re: How to deal with 96 Dimensional Points ?
Geert-Jan Brits wrote: Perhaps you could give us a (generalized) description of your use-case, so we can better grasp what you want to achieve, and how you want to use it. i.e: since I can't imagine/ envison a real 'eucledian distance' over 96 dimensions I bet you're talking a generalized distance function over N dimenions. This is usually only used in two general ways afaik: 2 calculating an ordered top-M list of closests points to the target point (Chris' implementation slightly altered) This is indeed the situation. A small alteration to chris his implementation won't do, since we do not know with radius to start with, so it is not just a matter of adapting the post-filtering. 3 (hmm maybe three: clustering points based on their distance to eachother) Yes, this is part of the usecase, but at the moment not my main focus. A statistical approach will need to employed for that, without going for full aggregation. It helps if we know what you'r after. For instance: if you're points don't change often and you want to achieve case 1 or 2 I would calculate these once and all-at-once and save them in a seperate table, bc. the on-demand variant may quickly become too slow. again depending on your case: option A. {pointid, {neighborids}} -- list of neigborids per point id, with pointid as key. option B {pointid, neighborid} -- one neighborid per point id, with pointid + neighborid as key. That is not an option. Every 2 minutes or so the next point is randomly choosen and we need a collection of points in the neighboorhood. perhaps also helpful foor google etc.: - a distance function if more often called a similarity function - top-n 'points' for a given point are usually called its neighbors. - in most cases you don't have to take the sqrt in Chris' implementation which can save a lot (but instead do: if($SumSq =($r*$r)){//code here} Indeed, but this is only a fraction of the time. The larger problem lies in searching all points that have potential. An idea that might work is to modify the radius of what we are looking at while we are searching based on the maximum radius we have so far and cut down distance comparisons if they will surely fall outside the current N closest neighbours. -- http://werner.yellowcouch.org/ signature.asc Description: OpenPGP digital signature
Re: How to deal with 96 Dimensional Points ?
Here is an idea, I'm not going to code this one:) It's still not an ideal solution because it has to make assumptions about your data set. Execute the algorithm I outlined previously with a very small r value, if you didn't find the number of points you are looking for, increase r and modify the query slightly so it doesn't return any of the points the first query returned something like AND `PointID` NOT in ('34', '56', '67', . . .). At every step along the way insert the point id of the points inside of r along with the distance they are from the test point, once you have over 100 records in this table stop increasing r and query the temp table sorted by distance with a limit of 100. Of course you have to have some knowledge of your data set to get a reasonable start value for r and a reasonable method for determining how much to increase it each time. On the other hand a minor modification seems better. By inserting all the points in the cube along with their distance in the temp table, a query like SELECT count(*) FROM temp WHERE `Distance` = r Would be a good way to see if you need to continue to the next round. Also doing it that way, instead of using the NOT IN syntax, which I understand can be slow, you can modify the where condition to find points that are inside the current cube of size r but are outside the previous cube. Chris W Werner Van Belle wrote: Hello Chris, The use case I' m talking about is actually a typical usecase for GIS applications: give me the x closest points to this one. E.g: give me the 10 points closest to (1,2,79) or in my case: give me the 100 points closest to (x1,x96). A query like yours might be possible and might be a good solution if we would know the radius in which we are looking for the points, but this is not really the case: we merely want a list returned ordered by distance. Solving this with your solution is possible but is quite slow. There exists nice datastructures to deal with this type of problem as said and these are used in the GIS implementation in MySql. Chris W wrote: I'm not sure why, but it seems that some people, I don't mean to imply that you are one of them, think there is some magic MySQL can preform to find points with in a given radius using the GIS extension. There is no magic. They simply use the well known math required to determine what points are inside the circle. GIS extenstions are also not only about distances: the above query is better solved with specialized datastructures. I could be wrong but I doubt there is any way to create an index that can directly indicate points with a a certain distance of other points unless that index included the distance from every point to every other point. That is obviously not practical since with a set of only 14 points the index would have over 6 billion entries. Partitioning of the space such as done in 3D render engines do solve this problem more efficiently than having a list of all pairtwise distances. So the question is not whether such algorithms exist, it is rather whether they are available in/through MySql. lets call each of your dimensions d1, d2, d3 d96. If you create an index on d1, d2, d69, you can then create a simple query that will quickly find all points that will find all points that are with in a bounding box. Since this query is going to get a bit large with 96 dimensions, I would use code to create the query. I will use php. Let's start with the desired radius being r and the test point dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . . $select = 'SELECT `PointID`, '; $where = 'WHERE '; foreach($TestPointD as $i = $d){ $di = 'd' . $i; $select .= `$di`, $MinD = $d - $r; $MaxD = $d + $r; $where .= `$di` = '$MinD' AND `$di` = '$MaxD' AND ; } $select = substr($select, 0, -2); //trim of the trailing comma and space $where = substr($where, 0, -4); //trim off the trailing 'AND ' $query = $select FROM `points` $where; Thanks for the nice illustration. In this case with the proper indices this will indeed split the space in sections; nevertheless this approach has great difficulties returning an ordered list of distances and prefereably only the 100 closest ones at that. Wkr, -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org