How to deal with 96 Dimensional Points ?

2010-03-30 Thread Werner Van Belle
Hello,

I have been pondering this for a while, but never really looked deeply
into the problem.

I have 96 dimensional points and I would like to pose queries such as:
'give me all points that are within such a radius of this one'. The gis
extensions to mysql might support such type of query. The problem is of
course that points are 2 dimensional and I'm not sure whether I can
extend it to more than 3 dimensions ?

Does anybody have an idea about this ?

Wkr,

-- 
http://werner.yellowcouch.org/




signature.asc
Description: OpenPGP digital signature


Re: How to deal with 96 Dimensional Points ?

2010-03-30 Thread Werner Van Belle
Geert-Jan Brits wrote:
 You're most likely talking about something like consine-similarity on
 N-dimensional vectors.
 http://en.wikipedia.org/wiki/Cosine_similarity
 http://stackoverflow.com/search?q=cosine+similarity

Cool links ! Although it is not why I need it for. I'm really talking
about an eucledian distance measure between vectors. So in a sense it is
simpler.  Normally the gis extensions already provide the basic tools
necessary, if only the points could be extended to more than 2 dimensions.
 You could google to see strategies that exist for mysql. 
 However, depending on your use-case (e.g: scalable recommender
 systems) mysql (or any other rdbms) may not be the best tool for the job.

-- 
http://werner.yellowcouch.org/




signature.asc
Description: OpenPGP digital signature


Re: How to deal with 96 Dimensional Points ?

2010-03-30 Thread Werner Van Belle
Johan De Meersman wrote:
 Well... a point in an n-dimensional space, is a location that has a
 defined value for each of it's n dimensions. If you have a value for
 each of your 96 dimensions, you have a point.
Well, it's fairly simple. If you have two points with 96 values in each.
Point1=(x1,...x96) and Point2=(y1,...,y96). The distance between these
two is

d=sqrt( (x_1-y_1)^2 + ... + (x_96-y_96)^2 )

There is no magic in this.
 The mathematics of comparing distances in 96 dimensions is beyond me,
 though :-) I guess a good start would be looking at comparing
 distances in 2 and 3 dimensions (vector math, that is) and trying to
 extrapolate a method from that. Alternatively, hire a mathematician :-p
Extrapolating from lower dimensions doesn't work too well. In this case
this would mean storing 48 different points and then trying to define a
distance measure based on each individual point. I'm not sure this is
feasable.

In general: KD-trees are quite good tools to deal with such large
dimensional spaces, but I see no possibility to use them in mysql,

Wkr,



 On Tue, Mar 30, 2010 at 11:39 AM, Werner Van Belle
 wer...@yellowcouch.org mailto:wer...@yellowcouch.org wrote:

 Hello,

 I have been pondering this for a while, but never really looked deeply
 into the problem.

 I have 96 dimensional points and I would like to pose queries such as:
 'give me all points that are within such a radius of this one'.
 The gis
 extensions to mysql might support such type of query. The problem
 is of
 course that points are 2 dimensional and I'm not sure whether I can
 extend it to more than 3 dimensions ?

 Does anybody have an idea about this ?

 Wkr,

 --
 http://werner.yellowcouch.org/





 -- 
 Bier met grenadyn
 Is als mosterd by den wyn
 Sy die't drinkt, is eene kwezel
 Hy die't drinkt, is ras een ezel


-- 
http://werner.yellowcouch.org/




signature.asc
Description: OpenPGP digital signature


Re: How to deal with 96 Dimensional Points ?

2010-03-30 Thread Chris W
I'm not sure why, but it seems that some people, I don't mean to imply 
that you are one of them, think there is some magic MySQL can preform to 
find points with in a given radius using the GIS extension.  There is no 
magic.  They simply use the well known math required to determine what 
points are inside the circle.  I could be wrong but I doubt there is any 
way to create an index that can directly indicate points with a a 
certain distance of other points unless that index included the distance 
from every point to every other point.  That is obviously not practical 
since with a set of only 14 points the index would have over 6 billion 
entries.


lets call each of your dimensions d1, d2, d3  d96. 
If you create an index on d1, d2,  d69, you can then create a simple 
query that will quickly find all points that will find all points that 
are with in a bounding box.  Since this query is going to get a bit 
large with 96 dimensions, I would use code to create the query.  I will 
use php.  Let's start with the desired radius being r and the test point 
dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . .



$select = 'SELECT `PointID`, ';
$where = 'WHERE ';
foreach($TestPointD as $i = $d){
 $di = 'd' . $i;
 $select .= `$di`, 
 $MinD = $d - $r;
 $MaxD = $d + $r;
 $where .= `$di` = '$MinD' AND `$di` = '$MaxD' AND ;
}
$select = substr($select, 0, -2);  //trim of the trailing comma and space
$where = substr($where, 0, -4);  //trim off the trailing 'AND '

$query = $select FROM `points` $where;


Obviously this is going to give you points outside the sphere but still 
inside the cube.  However it will narrow down the set so the further 
math will not take as long.  If this were 3 dimensions with an uniform 
distribution of points, about 52% of the points returned by that query 
will be inside the sphere.  I'm not sure how to calculate the ratio of 
the volume sphere to a cube in 96 dimensions.Then it will be a 
simple loop to find the points you really want.   While this query will 
likely return a lot of points that you don't want especially in 96D 
space, it will reduce it enough that the following loop will be much 
faster than looking all points in the table.




$result = mysql_query($query) or die(DB error $query  . mysql_error() );
while(($row = mysql_fetch_row($result))){
 $sum
 foreach($row as $i = $d){
   if($i == 0){
 $PointID = $d;
 continue; // skip point id at $row[0]
   }
   $SumSq += pow($TestPointD[$i] - $d, 2);
 }
 if(sqrt($SumSq) = $r){
   print $PointID is with in $r of test point.\n;
 }
}


In an application I had that was similar (but in 2D) I would insert the 
id of the points that passed the condition into a temp table.  Then I 
could join that temp table to other tables do other queries I may need 
on those points.


Chris W


Werner Van Belle wrote:

Hello,

I have been pondering this for a while, but never really looked deeply
into the problem.

I have 96 dimensional points and I would like to pose queries such as:
'give me all points that are within such a radius of this one'. The gis
extensions to mysql might support such type of query. The problem is of
course that points are 2 dimensional and I'm not sure whether I can
extend it to more than 3 dimensions ?

Does anybody have an idea about this ?

Wkr,

  


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: How to deal with 96 Dimensional Points ?

2010-03-30 Thread Geert-Jan Brits
Perhaps you could give us a (generalized) description of your use-case, so
we can better grasp what you want to achieve, and how you want to use it.
i.e: since I can't imagine/ envison a real 'eucledian distance' over 96
dimensions I bet you're talking a generalized distance function over N
dimenions.
This is usually only used in two general ways afaik: 1 calculating all
points that lie within a certain threshold (Chris' implementation prints
these points) 2 calculating an ordered top-M list of closests points to the
target point (Chris' implementation slightly altered) 3 (hmm maybe three:
clustering points based on their distance to eachother)
It helps if we know what you'r after. For instance: if you're points don't
change often and you want to achieve case 1 or 2  I would calculate these
once and all-at-once and save them in a seperate table, bc. the on-demand
variant may quickly become too slow. again depending on your case: option A.
{pointid, {neighborids}}   -- list of neigborids per point id, with
pointid as key.
option B {pointid, neighborid} -- one neighborid per  point id, with
pointid + neighborid as key.

perhaps also helpful foor google etc.: - a distance function if more often
called a similarity function
- top-n 'points' for a given point are usually called its neighbors. - in
most cases you don't have to take the sqrt in Chris' implementation which
can save a lot (but instead  do:  if($SumSq =($r*$r)){//code here}

2010/3/30 Chris W 4rfv...@cox.net

 I'm not sure why, but it seems that some people, I don't mean to imply that
 you are one of them, think there is some magic MySQL can preform to find
 points with in a given radius using the GIS extension.  There is no magic.
  They simply use the well known math required to determine what points are
 inside the circle.  I could be wrong but I doubt there is any way to create
 an index that can directly indicate points with a a certain distance of
 other points unless that index included the distance from every point to
 every other point.  That is obviously not practical since with a set of only
 14 points the index would have over 6 billion entries.

 lets call each of your dimensions d1, d2, d3  d96. If you create an
 index on d1, d2,  d69, you can then create a simple query that will
 quickly find all points that will find all points that are with in a
 bounding box.  Since this query is going to get a bit large with 96
 dimensions, I would use code to create the query.  I will use php.  Let's
 start with the desired radius being r and the test point dimensions being in
 an array TestPointD[1] = x, TestPointD[2] = . . .


 $select = 'SELECT `PointID`, ';
 $where = 'WHERE ';
 foreach($TestPointD as $i = $d){
  $di = 'd' . $i;
  $select .= `$di`, 
  $MinD = $d - $r;
  $MaxD = $d + $r;
  $where .= `$di` = '$MinD' AND `$di` = '$MaxD' AND ;
 }
 $select = substr($select, 0, -2);  //trim of the trailing comma and space
 $where = substr($where, 0, -4);  //trim off the trailing 'AND '

 $query = $select FROM `points` $where;


 Obviously this is going to give you points outside the sphere but still
 inside the cube.  However it will narrow down the set so the further math
 will not take as long.  If this were 3 dimensions with an uniform
 distribution of points, about 52% of the points returned by that query will
 be inside the sphere.  I'm not sure how to calculate the ratio of the volume
 sphere to a cube in 96 dimensions.Then it will be a simple loop to find
 the points you really want.   While this query will likely return a lot of
 points that you don't want especially in 96D space, it will reduce it enough
 that the following loop will be much faster than looking all points in the
 table.



 $result = mysql_query($query) or die(DB error $query  . mysql_error() );
 while(($row = mysql_fetch_row($result))){
  $sum
  foreach($row as $i = $d){
   if($i == 0){
 $PointID = $d;
 continue; // skip point id at $row[0]
   }
   $SumSq += pow($TestPointD[$i] - $d, 2);
  }
  if(sqrt($SumSq) = $r){
   print $PointID is with in $r of test point.\n;
  }
 }


 In an application I had that was similar (but in 2D) I would insert the id
 of the points that passed the condition into a temp table.  Then I could
 join that temp table to other tables do other queries I may need on those
 points.

 Chris W



 Werner Van Belle wrote:

 Hello,

 I have been pondering this for a while, but never really looked deeply
 into the problem.

 I have 96 dimensional points and I would like to pose queries such as:
 'give me all points that are within such a radius of this one'. The gis
 extensions to mysql might support such type of query. The problem is of
 course that points are 2 dimensional and I'm not sure whether I can
 extend it to more than 3 dimensions ?

 Does anybody have an idea about this ?

 Wkr,




 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:

Re: How to deal with 96 Dimensional Points ?

2010-03-30 Thread Werner Van Belle
Hello Chris,

The use case I'
m talking about is actually a typical usecase for GIS applications: give
me the x closest points to this one. E.g: give me the 10 points closest
to (1,2,79) or in my case: give me the 100 points closest to
(x1,x96). A query like yours might be possible and might be a good
solution if we would know the radius in which we are looking for the
points, but this is not really the case: we merely want a list returned
ordered by distance. Solving this with your solution is possible but is
quite slow. There exists nice datastructures to deal with this type of
problem as said and these are used in the GIS implementation in MySql.

Chris W wrote:
 I'm not sure why, but it seems that some people, I don't mean to imply
 that you are one of them, think there is some magic MySQL can preform
 to find points with in a given radius using the GIS extension.  There
 is no magic.  They simply use the well known math required to
 determine what points are inside the circle.
GIS extenstions are also not only about distances: the above query is
better solved with specialized datastructures.
 I could be wrong but I doubt there is any way to create an index that
 can directly indicate points with a a certain distance of other points
 unless that index included the distance from every point to every
 other point.  That is obviously not practical since with a set of only
 14 points the index would have over 6 billion entries.
Partitioning of the space such as done in 3D render engines do solve
this problem more efficiently than having a list of all pairtwise
distances.  So the question is not whether such algorithms exist, it is
rather whether they are available in/through MySql.

 lets call each of your dimensions d1, d2, d3  d96. If you create
 an index on d1, d2,  d69, you can then create a simple query that
 will quickly find all points that will find all points that are with
 in a bounding box.  Since this query is going to get a bit large with
 96 dimensions, I would use code to create the query.  I will use php. 
 Let's start with the desired radius being r and the test point
 dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . .

 $select = 'SELECT `PointID`, ';
 $where = 'WHERE ';
 foreach($TestPointD as $i = $d){
  $di = 'd' . $i;
  $select .= `$di`, 
  $MinD = $d - $r;
  $MaxD = $d + $r;
  $where .= `$di` = '$MinD' AND `$di` = '$MaxD' AND ;
 }
 $select = substr($select, 0, -2);  //trim of the trailing comma and space
 $where = substr($where, 0, -4);  //trim off the trailing 'AND '

 $query = $select FROM `points` $where;

Thanks for the nice illustration. In this case with the proper indices
this will indeed split the space in sections; nevertheless this approach
has great difficulties returning an ordered list of distances and
prefereably only the 100 closest ones at that.

Wkr,

-- 
http://werner.yellowcouch.org/




signature.asc
Description: OpenPGP digital signature


Re: How to deal with 96 Dimensional Points ?

2010-03-30 Thread Werner Van Belle
Geert-Jan Brits wrote:
 Perhaps you could give us a (generalized) description of your use-case, so
 we can better grasp what you want to achieve, and how you want to use it.
 i.e: since I can't imagine/ envison a real 'eucledian distance' over 96
 dimensions I bet you're talking a generalized distance function over N
 dimenions.
 This is usually only used in two general ways afaik: 
 2 calculating an ordered top-M list of closests points to the
 target point (Chris' implementation slightly altered) 
This is indeed the situation. A small alteration to chris his
implementation won't do, since we do not know with radius to start with,
so it is not just a matter of adapting the post-filtering.

 3 (hmm maybe three:
 clustering points based on their distance to eachother)
   
Yes, this is part of the usecase, but at the moment not my main focus. A
statistical approach will need to employed for that, without going for
full aggregation.
 It helps if we know what you'r after. For instance: if you're points don't
 change often and you want to achieve case 1 or 2  I would calculate these
 once and all-at-once and save them in a seperate table, bc. the on-demand
 variant may quickly become too slow. again depending on your case: option A.
 {pointid, {neighborids}}   -- list of neigborids per point id, with
 pointid as key.
 option B {pointid, neighborid} -- one neighborid per  point id, with
 pointid + neighborid as key.
   
That is not an option. Every 2 minutes or so the next point is randomly
choosen and we need a collection of points in the neighboorhood.
 perhaps also helpful foor google etc.: - a distance function if more often
 called a similarity function
 - top-n 'points' for a given point are usually called its neighbors. - in
 most cases you don't have to take the sqrt in Chris' implementation which
 can save a lot (but instead  do:  if($SumSq =($r*$r)){//code here}
   
Indeed, but this is only a fraction of the time. The larger problem lies
in searching all points that have potential. An idea that might work is
to modify the radius of what we are looking at while we are searching
based on the maximum radius we have so far and cut down distance
comparisons if they will surely fall outside the current N closest
neighbours.

-- 
http://werner.yellowcouch.org/




signature.asc
Description: OpenPGP digital signature


Re: How to deal with 96 Dimensional Points ?

2010-03-30 Thread Chris W
Here is an idea, I'm not going to code this one:)  It's still not an 
ideal solution because it has to make assumptions about your data set.  
Execute the algorithm I outlined previously with a very small r value, 
if you didn't find the number of points you are looking for, increase r 
and modify the query slightly so it doesn't return any of the points the 
first query returned  something like AND `PointID` NOT in ('34', 
'56', '67', . . .).
At every step along the way insert the point id of the points inside of 
r along with the distance they are from the test point, once you have 
over 100 records in this table stop increasing r and query the temp 
table sorted by distance with a limit of 100.  Of course you have to 
have some knowledge of your data set to get a reasonable start value for 
r and a reasonable method for determining how much to increase it each time.


On the other hand a minor modification seems better.  By inserting all 
the points in the cube along with their distance in the temp table, a 
query like SELECT count(*) FROM temp WHERE `Distance` = r Would be a 
good way to see if you need to continue to the next round.  Also doing 
it that way, instead of using the NOT IN syntax, which I understand can 
be slow, you can modify the where condition to find points that are 
inside the current cube of size r but are outside the previous cube.


Chris W

Werner Van Belle wrote:

Hello Chris,

The use case I'
m talking about is actually a typical usecase for GIS applications: give
me the x closest points to this one. E.g: give me the 10 points closest
to (1,2,79) or in my case: give me the 100 points closest to
(x1,x96). A query like yours might be possible and might be a good
solution if we would know the radius in which we are looking for the
points, but this is not really the case: we merely want a list returned
ordered by distance. Solving this with your solution is possible but is
quite slow. There exists nice datastructures to deal with this type of
problem as said and these are used in the GIS implementation in MySql.

Chris W wrote:
  

I'm not sure why, but it seems that some people, I don't mean to imply
that you are one of them, think there is some magic MySQL can preform
to find points with in a given radius using the GIS extension.  There
is no magic.  They simply use the well known math required to
determine what points are inside the circle.


GIS extenstions are also not only about distances: the above query is
better solved with specialized datastructures.
  

I could be wrong but I doubt there is any way to create an index that
can directly indicate points with a a certain distance of other points
unless that index included the distance from every point to every
other point.  That is obviously not practical since with a set of only
14 points the index would have over 6 billion entries.


Partitioning of the space such as done in 3D render engines do solve
this problem more efficiently than having a list of all pairtwise
distances.  So the question is not whether such algorithms exist, it is
rather whether they are available in/through MySql.

  

lets call each of your dimensions d1, d2, d3  d96. If you create
an index on d1, d2,  d69, you can then create a simple query that
will quickly find all points that will find all points that are with
in a bounding box.  Since this query is going to get a bit large with
96 dimensions, I would use code to create the query.  I will use php. 
Let's start with the desired radius being r and the test point

dimensions being in an array TestPointD[1] = x, TestPointD[2] = . . .

$select = 'SELECT `PointID`, ';
$where = 'WHERE ';
foreach($TestPointD as $i = $d){
 $di = 'd' . $i;
 $select .= `$di`, 
 $MinD = $d - $r;
 $MaxD = $d + $r;
 $where .= `$di` = '$MinD' AND `$di` = '$MaxD' AND ;
}
$select = substr($select, 0, -2);  //trim of the trailing comma and space
$where = substr($where, 0, -4);  //trim off the trailing 'AND '

$query = $select FROM `points` $where;



Thanks for the nice illustration. In this case with the proper indices
this will indeed split the space in sections; nevertheless this approach
has great difficulties returning an ordered list of distances and
prefereably only the 100 closest ones at that.

Wkr,

  


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org