Re: [Numpy-discussion] finding close together points.

2009-11-16 Thread Anne Archibald
2009/11/16 Christopher Barker chris.bar...@noaa.gov:
 Anne Archibald wrote:
 2009/11/13 Christopher Barker chris.bar...@noaa.gov:
 Wow! great -- you sounded interested, but I had no idea you'd run out
 and do it! thanks! we'll check it out.

 well, it turns out the Python version is unacceptably slow. However, we
 found we could use ckdtree, and use:

 tree.query(points,
             2, # Number of points to return per point given.
             distance_upper_bound = distance_tolerance,
            )

 This gives us a set of pairs of points closer than our distance
 tolerance -- it includes duplicates, of course, but even when we filter
 those in pure python, it's is nice and speedy.

 It looks like it would be pretty easy to add this to the Cython code,
 but it's working now for us, so...

You probably noticed this, but I should remind you that if a point has
more than one nearby neighbor, this may not give you what you wanted.
In particular, if there are three points all within the fixed distance
from each other, you may not find all three pairs this way.

Anne

 Thanks again,

 -Chris



 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-13 Thread Christopher Barker
Anne Archibald wrote:
 2009/11/10 Christopher Barker chris.bar...@noaa.gov:
 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

 This is now implemented in SVN.

Wow! great -- you sounded interested, but I had no idea you'd run out 
and do it! thanks! we'll check it out.

 I (tentatively?) used a set to store
 the collection of pairs, because my tree traversal is not smart enough
 to reliably uniquify the pairs without using sets. With more time and
 energy, I'm sure the algorithm could be improved to avoid using sets
 (both internally and on return), but I think that's something to save
 for the Cython version.

I agree -- what's wrong with using a set?

Thanks, we'll let you know how it works for us.

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-13 Thread Anne Archibald
2009/11/13 Christopher Barker chris.bar...@noaa.gov:
 Anne Archibald wrote:
 2009/11/10 Christopher Barker chris.bar...@noaa.gov:
 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

 This is now implemented in SVN.

 Wow! great -- you sounded interested, but I had no idea you'd run out
 and do it! thanks! we'll check it out.

No problem, it was a fairly minor modification of the two-tree code.

 I (tentatively?) used a set to store
 the collection of pairs, because my tree traversal is not smart enough
 to reliably uniquify the pairs without using sets. With more time and
 energy, I'm sure the algorithm could be improved to avoid using sets
 (both internally and on return), but I think that's something to save
 for the Cython version.

 I agree -- what's wrong with using a set?

It should be possible to arrange the tree traversal algorithm so that
each pair of subtrees is automatically traversed only once - akin to
for i in range(n):
for j in range(i):
This would avoid having to build and modify a set every time you
traverse the tree (sets are fairly cheap hash tables, so the cost is
probably minor compared to other python overhead), and it would also
naturally allow the result to be a list/inflatable array (which would
save memory, if nothing else). But it would also require carefully
thinking through the tree traversal algorithm.

 Thanks, we'll let you know how it works for us.

Good luck!

Anne

 -Chris


 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Christopher Barker
Peter Schmidtke wrote:
 On Tue, 10 Nov 2009 16:07:32 -0800, Christopher Barker
 chris.bar...@noaa.gov wrote:

 I have a bunch of points in 2-d space, and I need to find out which 
 pairs of points are within a certain distance of one-another (regular 
 old Euclidean norm).
 
 How big is your set of points?

Could be 100s of thousands, maybe ever millions. That's why O(N^2) is 
not good.

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Christopher Barker
David Cournapeau wrote:

 I would love having a core C library of containers - 

I'm all for that. However, I think that a very, very common need is 
simply for a growable numpy array.

It seems this would actually be pretty darn easy (again, for someone 
familiar with the code!).

IIUC, it would be exactly like a regular old ndarray, except:

  1) The data block could be larger than the current size of the array:
 - this would require an additional attribute to carry that size

  2) There would be an append method, or some other name:
 - this would increase the size of the array, manipulating 
.dimensions, and, if needed, doing a reallocation of the data pointer.
 - This is not very complex code, really, you simply allocate more 
than required, choosing some (user-resettable?) scale: 25% more than 
before, or whatever. re-allocating memory is actually pretty cheap these 
days, so it's not all that critical how you do it.

  3) we'd probably want a .fit() method that would resize the data block 
to fit the current size of the array.

In fact, as I write this, I wonder if these features could simply be 
added to the regular old ndarray -- if not used, they wouldn't be much 
overhead.

Issues:

  - views on the same data block: as the data pointer would need to be 
re-allocated, you couldn't have views on the data block. Maybe this 
could be handles in the same way that ndarray.resize is handles now: 
you'd get an error if you try to resize an array that has a view to its 
data block. This is a bit tricky, as it's really easy to create a view 
without thinking about it (like by using ipython, for instance)

  - You could only grow the array in the first dimension (C-order, 
anyway). I think it could still be very useful, handling many of the 
common cases.

  - if you append to an array of more than one dimension, do you need to 
give the entire appended slice? Again, restrictive, but still useful.

  - Others? -- I'm not thinking of any right now.


David Cournapeau wrote:
 If you need to manipulate
 core datatypes (float, etc...), I think python lists are a significant
 cost, because of pointer chasing in particular.

Exactly, and add custom numpy dtypes as well. Putting python types 
and/or tuples of python types into a list can be far more memory 
intensive than putting them in a numpy array.

That's why I built a python implementation of a growable array -- it 
does save memory, but it's a fair bit slower than lists -- when adding 
to it in python, you've got python types already, so the additional 
overhead of an extra function call, etc. is substantial. That's why I'd 
like a C version.

Even more important is having one that is not only written in C (or 
Cython) but can be accessed directly from C (or Cython), so you can fill 
an array with core data types efficiently. Going from C-type to python 
object to put it in a list is substantial overhead.

 Stefan used some tricks to avoid using python list and use basic C
 structures in some custom code. 

Exactly -- and he's not he only one -- we're all re-inventing the wheel 
here.

By the way -- what does the fromstring/fromfile code do when you don't 
give it a size to start with?


-Chris








-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Charles R Harris
On Thu, Nov 12, 2009 at 10:01 AM, Christopher Barker
chris.bar...@noaa.govwrote:

 Peter Schmidtke wrote:
  On Tue, 10 Nov 2009 16:07:32 -0800, Christopher Barker
  chris.bar...@noaa.gov wrote:

  I have a bunch of points in 2-d space, and I need to find out which
  pairs of points are within a certain distance of one-another (regular
  old Euclidean norm).
 
  How big is your set of points?

 Could be 100s of thousands, maybe ever millions. That's why O(N^2) is
 not good.


Another question is density. Are the points randomly scattered or do they
cluster? The order N^2 part can be taken care of if you can easily eliminate
a large fraction of the points, say by gridding the 2D part.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Christopher Barker
Lou Pecora wrote:
 Maybe I'm missing something simple, but if your array of 2D points is
 static,

well, not quite.

 a KD tree for 2D nearest neighbor seems like over kill.  You
 might want to try the simple approach of using boxes of points to
 narrow things down by sorting on the first component.

yeah, we'll probably do something like that if we have to write the code 
ourselves. At the moment, we're using geohash:

http://pypi.python.org/pypi/Geohash/

(this is for points on the earth)

and it's working OK. I was just hoping kdtree would work out of the box!

  where for a
 static data set it can match KD trees in speed

Why does it have to be static -- it doesn't look hard to insert/remove 
points.

-Chris





-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Charles R Harris
On Thu, Nov 12, 2009 at 10:32 AM, Christopher Barker
chris.bar...@noaa.govwrote:

 David Cournapeau wrote:

  I would love having a core C library of containers -

 I'm all for that. However, I think that a very, very common need is
 simply for a growable numpy array.

 It seems this would actually be pretty darn easy (again, for someone
 familiar with the code!).

 IIUC, it would be exactly like a regular old ndarray, except:

  1) The data block could be larger than the current size of the array:
 - this would require an additional attribute to carry that size

  2) There would be an append method, or some other name:
 - this would increase the size of the array, manipulating
 .dimensions, and, if needed, doing a reallocation of the data pointer.
 - This is not very complex code, really, you simply allocate more
 than required, choosing some (user-resettable?) scale: 25% more than
 before, or whatever. re-allocating memory is actually pretty cheap these
 days, so it's not all that critical how you do it.

  3) we'd probably want a .fit() method that would resize the data block
 to fit the current size of the array.

 In fact, as I write this, I wonder if these features could simply be
 added to the regular old ndarray -- if not used, they wouldn't be much
 overhead.


I'm rapidly losing interest here. Perhaps you could supply some code
implementing this new array? Why not just a class using an array that
doubles the array size when an index is out of bounds and copies over the
old data. That is pretty much what realloc does. As to python lists, do you
have any benchmarks showing how bad python lists are compared to arrays?

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Robert Kern
On Thu, Nov 12, 2009 at 11:39, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Thu, Nov 12, 2009 at 10:32 AM, Christopher Barker chris.bar...@noaa.gov
 wrote:

 David Cournapeau wrote:

  I would love having a core C library of containers -

 I'm all for that. However, I think that a very, very common need is
 simply for a growable numpy array.

 It seems this would actually be pretty darn easy (again, for someone
 familiar with the code!).

 IIUC, it would be exactly like a regular old ndarray, except:

  1) The data block could be larger than the current size of the array:
     - this would require an additional attribute to carry that size

  2) There would be an append method, or some other name:
     - this would increase the size of the array, manipulating
 .dimensions, and, if needed, doing a reallocation of the data pointer.
     - This is not very complex code, really, you simply allocate more
 than required, choosing some (user-resettable?) scale: 25% more than
 before, or whatever. re-allocating memory is actually pretty cheap these
 days, so it's not all that critical how you do it.

  3) we'd probably want a .fit() method that would resize the data block
 to fit the current size of the array.

 In fact, as I write this, I wonder if these features could simply be
 added to the regular old ndarray -- if not used, they wouldn't be much
 overhead.


 I'm rapidly losing interest here. Perhaps you could supply some code
 implementing this new array? Why not just a class using an array that
 doubles the array size when an index is out of bounds and copies over the
 old data. That is pretty much what realloc does. As to python lists, do you
 have any benchmarks showing how bad python lists are compared to arrays?

Didn't we already do this?

http://www.mail-archive.com/numpy-discussion@scipy.org/msg21010.html

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Lou Pecora


- Original Message 
From: Christopher Barker chris.bar...@noaa.gov
To: Discussion of Numerical Python numpy-discussion@scipy.org
Sent: Thu, November 12, 2009 12:37:37 PM
Subject: Re: [Numpy-discussion] finding close together points.

Lou Pecora wrote:
 a KD tree for 2D nearest neighbor seems like over kill.  You
 might want to try the simple approach of using boxes of points to
 narrow things down by sorting on the first component.

yeah, we'll probably do something like that if we have to write the code 
ourselves. At the moment, we're using geohash:

http://pypi.python.org/pypi/Geohash/

(this is for points on the earth)

and it's working OK. I was just hoping kdtree would work out of the box!

 where for a
 static data set it can match KD trees in speed

Why does it have to be static -- it doesn't look hard to insert/remove 
points.

--- Lou answers --


It doesn't have to be static, but if I remember correctly, when you add points 
you have to resort or do a smart insert (which may require a lot of data 
shifting).  Not hard, but the original paper claimed that if there are a lot 
of changes to the data set this becomes more expensive than the kd-tree.  If 
you can get the original paper I mentioned, it will have more info.  It's 
pretty clearly written and contains some nice comparisons to kd-trees for 
finding near neighbors.  

Bottom line is that you can still try it with changing data.  See what the run 
times are like.  I've used the approach for up to eight dimensions with about 
10^5 data points.  It worked nicely.  I remember looking for a kd-tree library 
that would work out of the box (C or C++), but everything I found was not as 
simple as I would have liked.  The slice searching was a nice solution.  But 
maybe I just didn't know where to look or I'm not understanding some of the 
libraries I did find.

Let us know how it goes.

-- Lou Pecora


  
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Anne Archibald
2009/11/12 Lou Pecora lou_boog2...@yahoo.com:


 - Original Message 
 From: Christopher Barker chris.bar...@noaa.gov
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Sent: Thu, November 12, 2009 12:37:37 PM
 Subject: Re: [Numpy-discussion] finding close together points.

 Lou Pecora wrote:
 a KD tree for 2D nearest neighbor seems like over kill.  You
 might want to try the simple approach of using boxes of points to
 narrow things down by sorting on the first component.

 yeah, we'll probably do something like that if we have to write the code
 ourselves. At the moment, we're using geohash:

 http://pypi.python.org/pypi/Geohash/

 (this is for points on the earth)

 and it's working OK. I was just hoping kdtree would work out of the box!

 where for a
 static data set it can match KD trees in speed

 Why does it have to be static -- it doesn't look hard to insert/remove
 points.

 --- Lou answers --


 It doesn't have to be static, but if I remember correctly, when you add 
 points you have to resort or do a smart insert (which may require a lot of 
 data shifting).  Not hard, but the original paper claimed that if there are 
 a lot of changes to the data set this becomes more expensive than the 
 kd-tree.  If you can get the original paper I mentioned, it will have more 
 info.  It's pretty clearly written and contains some nice comparisons to 
 kd-trees for finding near neighbors.

 Bottom line is that you can still try it with changing data.  See what the 
 run times are like.  I've used the approach for up to eight dimensions with 
 about 10^5 data points.  It worked nicely.  I remember looking for a kd-tree 
 library that would work out of the box (C or C++), but everything I found was 
 not as simple as I would have liked.  The slice searching was a nice 
 solution.  But maybe I just didn't know where to look or I'm not 
 understanding some of the libraries I did find.

 Let us know how it goes.

Just a quick comment: none of the KDTrees in scipy.spatial support any
kind of modification right now, so if your data set changes you will
need to rebuild them completely. Construction is fairly cheap, but
it's very unlikely to compete with a data structure that supports
modification algorithms.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Christopher Barker
Robert Kern wrote:

 Didn't we already do this?
 
 http://www.mail-archive.com/numpy-discussion@scipy.org/msg21010.html

Indeed we did. What I posted then ( and have improved a bit now). Is a 
Python version. Written in Python, it has an advantage of using less 
memory for a big array, but is slower in other respects than a Python 
list. This is probably why we all use lists for this when we need it!

What I'd like (and I have seen interest from others) is a C version, one 
that could be used from C code directly as well. I can't benchmark that, 
as it hasn't been written!

I'm afraid I'm out of my depth as to how to write such a thing in C (at 
least so that it is well integrated into numpy). I suppose I could write 
a simple C accumulating array, and only convert it into a numpy array at 
the end (indeed, I have done that in the past), and benchmark that.


Charles R Harris wrote:

 I'm rapidly losing interest here.

Maybe that's why it hasn't been done yet -- it's not an interesting 
enough problem!

  Why not just a class using an array that 
 doubles the array size when an index is out of bounds and copies over 
 the old data.

Yes, it pretty much is that simple -- that's my point. Simple and 
useful, and why should we all re-invent this wheel each time (even it if 
is a simple wheel)? At the C(ython) level, it's not that hard to simply 
do your accumulating in regular old C, and then convert to a numpy 
array, but it would be nice for it to be easier, particularly for 
various data-neutral code, particularly numpy custom-defined ones.

-Chris















-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Charles R Harris
On Thu, Nov 12, 2009 at 12:52 PM, Christopher Barker
chris.bar...@noaa.govwrote:

 Robert Kern wrote:

  Didn't we already do this?
 
  http://www.mail-archive.com/numpy-discussion@scipy.org/msg21010.html

 Indeed we did. What I posted then ( and have improved a bit now). Is a
 Python version. Written in Python, it has an advantage of using less
 memory for a big array, but is slower in other respects than a Python
 list. This is probably why we all use lists for this when we need it!


IIRC, python lists *are* expanding arrays of contiguous memory, not linked
lists. They are efficient for appending and indexing, less so for deletion
and insertion. It's one of those tradeoffs. So what you buy with an array
implementation is the space/time efficiency of not having to allocate python
types to put on the list. But you probably need to go through a python type
at some point anyway, so...

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Charles R Harris
On Thu, Nov 12, 2009 at 3:30 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:



 On Thu, Nov 12, 2009 at 12:52 PM, Christopher Barker 
 chris.bar...@noaa.gov wrote:

 Robert Kern wrote:

  Didn't we already do this?
 
  http://www.mail-archive.com/numpy-discussion@scipy.org/msg21010.html

 Indeed we did. What I posted then ( and have improved a bit now). Is a
 Python version. Written in Python, it has an advantage of using less
 memory for a big array, but is slower in other respects than a Python
 list. This is probably why we all use lists for this when we need it!


 IIRC, python lists *are* expanding arrays of contiguous memory, not linked
 lists. They are efficient for appending and indexing, less so for deletion
 and insertion. It's one of those tradeoffs. So what you buy with an array
 implementation is the space/time efficiency of not having to allocate python
 types to put on the list. But you probably need to go through a python type
 at some point anyway, so...


And here is a pep http://www.python.org/dev/peps/pep-3128/%20 for a Blist
with a table of some of the performance tradeoffs. I think you could use
small ndarrays in the leaf nodes and get around having to handle all the
type stuff.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Christopher Barker
Charles R Harris wrote:
 So what you buy 
 with an array implementation is the space/time efficiency of not having 
 to allocate python types to put on the list. But you probably need to go 
 through a python type at some point anyway,

When writing Python, yes (though maybe not if you are appending a lot of 
data at once from a numpy array). That's whey my python implementation 
is slower, and I suspect a C implementation would be similar in 
performance to a list.

But if you are writing a C extension in C or Cython, then you could get 
a real gain by not having to go through Python types.

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Christopher Barker
Charles R Harris wrote:
 And here is a pep http://www.python.org/dev/peps/pep-3128/%20

That link was broken, try this one:

http://www.python.org/dev/peps/pep-3128/

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-12 Thread Anne Archibald
2009/11/11 Christopher Barker chris.bar...@noaa.gov:
 Anne Archibald wrote:
 2009/11/10 Christopher Barker chris.bar...@noaa.gov:

 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

 This is an eminently reasonable thing to want, and KDTree should
 support it. Unfortunately it doesn't.

 darn.

 It's slow both because you're using the python implementation rather
 than the C,

 I noticed that.

 The one good thing about using the python implementation rather than
 the Cython one is that you can subclass it, providing a new method.
 There's still a certain amount of boilerplate code to write, but it
 shouldn't be too bad.

 Can you give me a hint as to where to start?  I have no idea how kd
 trees work.

 If this is still too slow, I have no problem incorporating additional
 code into cKDTree; the only reason none of the ball queries are in
 there is because ball queries must return variable-size results, so
 you lose a certain amount of speed because you're forced to manipulate
 python objects. But if there are relatively few result points, this
 need not be much of a slowdown.

 And certainly in this case, there would not be very many results, and
 lists are pretty fast.

This is now implemented in SVN. I (tentatively?) used a set to store
the collection of pairs, because my tree traversal is not smart enough
to reliably uniquify the pairs without using sets. With more time and
energy, I'm sure the algorithm could be improved to avoid using sets
(both internally and on return), but I think that's something to save
for the Cython version.

Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-11 Thread Lou Pecora
- Original Message 
From: Christopher Barker chris.bar...@noaa.gov
To: Discussion of Numerical Python numpy-discussion@scipy.org
Sent: Tue, November 10, 2009 7:07:32 PM
Subject: [Numpy-discussion] finding close together points.

Hi all,

I have a bunch of points in 2-d space, and I need to find out which 
pairs of points are within a certain distance of one-another (regular 
old Euclidean norm).

scipy.spatial.KDTree.query_ball_tree() seems like it's built for this.




Chris,

Maybe I'm missing something simple, but if your array of 2D points is static, a 
KD tree for 2D nearest neighbor seems like over kill.  You might want to try 
the simple approach of using boxes of points to narrow things down by sorting 
on the first component.  If your distances are, say, 10% of the variance, then 
you'll *roughly* decrease the remaining points to search by a factor of 10.  
This can get more sophisticated and is useful in higher dimensions (see:  
Sameer A. Nene and Shree K. Nayar, A Simple Algorithm for Nearest Neighbor 
Search in High Dimensions, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE 
INTELLIGENCE 19 (9), 989 (1997).) where for a static data set it can match KD 
trees in speed, but is SO much easier to program.  
 -- Lou Pecora, my views are my own.


  
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-11 Thread Charles R Harris
On Tue, Nov 10, 2009 at 11:15 PM, David Cournapeau 
da...@ar.media.kyoto-u.ac.jp wrote:

 Charles R Harris wrote:
 
  I think Python lists are basically just expanding arrays and pointers
  are cheap. Where you might lose is in creating python objects to put
  in the list and not having ufuncs and the rest of the numpy machinery.
  If you don't need the machinery, lists are probably not a bad way to go.

 I am not familiar enough with the kdtree code: if you need to manipulate
 core datatypes (float, etc...), I think python lists are a significant
 cost, because of pointer chasing in particular. It may or may not apply,
 but Stefan used some tricks to avoid using python list and use basic C
 structures in some custom code (written in cython), maybe he has
 something to say.

 I would love having a core C library of containers - there are a few
 candidates we could take code from:


I've got a cythonized structure for union-find (relations) and we could put
together a heap pretty easily from heap sort. I've also got one for the
problem at hand, but it is for several thousand points (stars) scattered
over an image and uses c++ lists.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread Anne Archibald
2009/11/10 Christopher Barker chris.bar...@noaa.gov:
 Hi all,

 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

This is an eminently reasonable thing to want, and KDTree should
support it. Unfortunately it doesn't.

 scipy.spatial.KDTree.query_ball_tree() seems like it's built for this.

 However, I'm a bit confused. The first argument is a kdtree, but I'm
 calling it as a method of a kdtree -- I want to know which points in the
 tree I already have are closer that some r from each-other.

 If I call it as:

 tree.query_ball_tree(tree, r)

 I get a big list, that has all the points in it (some of them paired up
 with close neighbors.) It appears I'm getting the distances between all
 the points in the tree and itself, as though they were different trees.

 This is slow, takes a bunch of memory, and I then have to parse out the
 list to find the ones that are paired up.

 Is there a way to get just the close ones from the single tree?

Unfortunately not at the moment.

It's slow both because you're using the python implementation rather
than the C, and because you're getting all pairs where pair
includes pairing a point with itself (and also each distinct pair in
both orders). The tree really should allow self-queries that don't
return the point and itself.

The one good thing about using the python implementation rather than
the Cython one is that you can subclass it, providing a new method.
There's still a certain amount of boilerplate code to write, but it
shouldn't be too bad.

If this is still too slow, I have no problem incorporating additional
code into cKDTree; the only reason none of the ball queries are in
there is because ball queries must return variable-size results, so
you lose a certain amount of speed because you're forced to manipulate
python objects. But if there are relatively few result points, this
need not be much of a slowdown.

Anne

 thanks,

 -Chris






 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread James Bergstra
On Tue, Nov 10, 2009 at 7:07 PM, Christopher Barker
chris.bar...@noaa.gov wrote:
 Hi all,

 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

 scipy.spatial.KDTree.query_ball_tree() seems like it's built for this.


In some cases a brute-force approach is also good.
If r is a matrix of shape Nx2:

(r*r).sum(axis=1) -2 * numpy.dot(r, r.T) +
(r*r).sum(axis=1).reshape((r.shape[0], 1))  thresh**2

It's brute force, but it takes advantage of fast matrix multiplication.

-- 
http://www-etud.iro.umontreal.ca/~bergstrj
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread Christopher Barker
James Bergstra wrote:
 In some cases a brute-force approach is also good.

true.

 If r is a matrix of shape Nx2:
 
 (r*r).sum(axis=1) -2 * numpy.dot(r, r.T) +
 (r*r).sum(axis=1).reshape((r.shape[0], 1))  thresh**2
 
 It's brute force, but it takes advantage of fast matrix multiplication.

I'm more concerned about memory -- doesn't this use N^2 memory? Which 
could be an issue here.

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread James Bergstra
On Tue, Nov 10, 2009 at 8:17 PM, Christopher Barker
chris.bar...@noaa.gov wrote:
 James Bergstra wrote:
 In some cases a brute-force approach is also good.

 true.

 If r is a matrix of shape Nx2:

 (r*r).sum(axis=1) -2 * numpy.dot(r, r.T) +
 (r*r).sum(axis=1).reshape((r.shape[0], 1))  thresh**2

 It's brute force, but it takes advantage of fast matrix multiplication.

 I'm more concerned about memory -- doesn't this use N^2 memory? Which
 could be an issue here.

Yes, this uses N^2 time and space.  It's not a good algorithm when N is large.

-- 
http://www-etud.iro.umontreal.ca/~bergstrj
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread David Goldsmith
Also, is it not returning distances between points and themselves?  Or am I
misinterpreting it?

DG

On Tue, Nov 10, 2009 at 5:17 PM, Christopher Barker
chris.bar...@noaa.govwrote:

 James Bergstra wrote:
  In some cases a brute-force approach is also good.

 true.

  If r is a matrix of shape Nx2:
 
  (r*r).sum(axis=1) -2 * numpy.dot(r, r.T) +
  (r*r).sum(axis=1).reshape((r.shape[0], 1))  thresh**2
 
  It's brute force, but it takes advantage of fast matrix multiplication.

 I'm more concerned about memory -- doesn't this use N^2 memory? Which
 could be an issue here.

 -Chris


 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread josef . pktd
On Tue, Nov 10, 2009 at 7:48 PM, Anne Archibald
peridot.face...@gmail.com wrote:
 2009/11/10 Christopher Barker chris.bar...@noaa.gov:
 Hi all,

 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).

 This is an eminently reasonable thing to want, and KDTree should
 support it. Unfortunately it doesn't.

 scipy.spatial.KDTree.query_ball_tree() seems like it's built for this.

 However, I'm a bit confused. The first argument is a kdtree, but I'm
 calling it as a method of a kdtree -- I want to know which points in the
 tree I already have are closer that some r from each-other.

 If I call it as:

 tree.query_ball_tree(tree, r)

 I get a big list, that has all the points in it (some of them paired up
 with close neighbors.) It appears I'm getting the distances between all
 the points in the tree and itself, as though they were different trees.

 This is slow, takes a bunch of memory, and I then have to parse out the
 list to find the ones that are paired up.

 Is there a way to get just the close ones from the single tree?

I used sparse_distance_matrix for the distance of a kdtree to itself
in the past.

Since it's a matrix it should be possible to just get the lower or
upper triangle, and it's sparse so memory is not so much of a problem.
But I remember it was also slow and only worth using if the matrix is
large.

Josef


 Unfortunately not at the moment.

 It's slow both because you're using the python implementation rather
 than the C, and because you're getting all pairs where pair
 includes pairing a point with itself (and also each distinct pair in
 both orders). The tree really should allow self-queries that don't
 return the point and itself.

 The one good thing about using the python implementation rather than
 the Cython one is that you can subclass it, providing a new method.
 There's still a certain amount of boilerplate code to write, but it
 shouldn't be too bad.

 If this is still too slow, I have no problem incorporating additional
 code into cKDTree; the only reason none of the ball queries are in
 there is because ball queries must return variable-size results, so
 you lose a certain amount of speed because you're forced to manipulate
 python objects. But if there are relatively few result points, this
 need not be much of a slowdown.

 Anne

 thanks,

 -Chris






 --
 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR            (206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115       (206) 526-6317   main reception

 chris.bar...@noaa.gov
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread Christopher Barker
Anne Archibald wrote:
 2009/11/10 Christopher Barker chris.bar...@noaa.gov:

 I have a bunch of points in 2-d space, and I need to find out which
 pairs of points are within a certain distance of one-another (regular
 old Euclidean norm).
 
 This is an eminently reasonable thing to want, and KDTree should
 support it. Unfortunately it doesn't.

darn.

 It's slow both because you're using the python implementation rather
 than the C,

I noticed that.

 The one good thing about using the python implementation rather than
 the Cython one is that you can subclass it, providing a new method.
 There's still a certain amount of boilerplate code to write, but it
 shouldn't be too bad.

Can you give me a hint as to where to start?  I have no idea how kd 
trees work.

 If this is still too slow, I have no problem incorporating additional
 code into cKDTree; the only reason none of the ball queries are in
 there is because ball queries must return variable-size results, so
 you lose a certain amount of speed because you're forced to manipulate
 python objects. But if there are relatively few result points, this
 need not be much of a slowdown.

And certainly in this case, there would not be very many results, and 
lists are pretty fast.

However, this does bring up the need for a growable numpy array. I've 
been working on one in Python, but it would be really nice to have one 
that could be accessed by C (or cython) code.

It wouldn't be that hard to do (for someone familiar with the numpy 
code, anyway), I don't think. It is mostly boiler plate in Python. I'm 
imagining it would only support contiguous arrays, and could only grow 
is the first dimension.




-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread Charles R Harris
On Tue, Nov 10, 2009 at 10:51 PM, Christopher Barker
chris.bar...@noaa.govwrote:

 Anne Archibald wrote:
  2009/11/10 Christopher Barker chris.bar...@noaa.gov:

  I have a bunch of points in 2-d space, and I need to find out which
  pairs of points are within a certain distance of one-another (regular
  old Euclidean norm).
 
  This is an eminently reasonable thing to want, and KDTree should
  support it. Unfortunately it doesn't.

 darn.

  It's slow both because you're using the python implementation rather
  than the C,

 I noticed that.

  The one good thing about using the python implementation rather than
  the Cython one is that you can subclass it, providing a new method.
  There's still a certain amount of boilerplate code to write, but it
  shouldn't be too bad.

 Can you give me a hint as to where to start?  I have no idea how kd
 trees work.

  If this is still too slow, I have no problem incorporating additional
  code into cKDTree; the only reason none of the ball queries are in
  there is because ball queries must return variable-size results, so
  you lose a certain amount of speed because you're forced to manipulate
  python objects. But if there are relatively few result points, this
  need not be much of a slowdown.

 And certainly in this case, there would not be very many results, and
 lists are pretty fast.

 However, this does bring up the need for a growable numpy array. I've
 been working on one in Python, but it would be really nice to have one
 that could be accessed by C (or cython) code.

 It wouldn't be that hard to do (for someone familiar with the numpy
 code, anyway), I don't think. It is mostly boiler plate in Python. I'm
 imagining it would only support contiguous arrays, and could only grow
 is the first dimension.



I think Python lists are basically just expanding arrays and pointers are
cheap. Where you might lose is in creating python objects to put in the list
and not having ufuncs and the rest of the numpy machinery. If you don't need
the machinery, lists are probably not a bad way to go.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread David Cournapeau
Charles R Harris wrote:

 I think Python lists are basically just expanding arrays and pointers
 are cheap. Where you might lose is in creating python objects to put
 in the list and not having ufuncs and the rest of the numpy machinery.
 If you don't need the machinery, lists are probably not a bad way to go.

I am not familiar enough with the kdtree code: if you need to manipulate
core datatypes (float, etc...), I think python lists are a significant
cost, because of pointer chasing in particular. It may or may not apply,
but Stefan used some tricks to avoid using python list and use basic C
structures in some custom code (written in cython), maybe he has
something to say.

I would love having a core C library of containers - there are a few
candidates we could take code from:
- http://github.com/stevedekorte/basekit: used in IO programming
language. Quite a few goodies, from list to trees to sets to portable
code to load dynamically loaded libraries. BSD
- The Apple Core Foundation (I have to check the Apple license is
ok). It is more complex, but is designed with objective-C in mind,
meaning integration with the C python API may be easier (both Obj-C and
Python C API being based on ref counting).
- Another one mentioned by Robert Kern, I forgot the reference.

I hope to make it my main focus for the 1.5 release,

cheers,

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread Robert Kern
On Wed, Nov 11, 2009 at 01:15, David Cournapeau
da...@ar.media.kyoto-u.ac.jp wrote:

    - The Apple Core Foundation (I have to check the Apple license is
 ok).

No. The APSL is not DFSG-free.

 It is more complex, but is designed with objective-C in mind,
 meaning integration with the C python API may be easier (both Obj-C and
 Python C API being based on ref counting).
    - Another one mentioned by Robert Kern, I forgot the reference.

http://c-algorithms.sourceforge.net/

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread David Cournapeau
Robert Kern wrote:

 No. The APSL is not DFSG-free.
   

It was too good to be true, I guess.


 http://c-algorithms.sourceforge.net/
   

Thanks for the link,

David

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] finding close together points.

2009-11-10 Thread Peter Schmidtke
On Tue, 10 Nov 2009 16:07:32 -0800, Christopher Barker
chris.bar...@noaa.gov wrote:
 Hi all,
 
 I have a bunch of points in 2-d space, and I need to find out which 
 pairs of points are within a certain distance of one-another (regular 
 old Euclidean norm).

How big is your set of points?

 
 scipy.spatial.KDTree.query_ball_tree() seems like it's built for this.
 
 However, I'm a bit confused. The first argument is a kdtree, but I'm 
 calling it as a method of a kdtree -- I want to know which points in the 
 tree I already have are closer that some r from each-other.
 
 If I call it as:
 
 tree.query_ball_tree(tree, r)
 
 I get a big list, that has all the points in it (some of them paired up 
 with close neighbors.) It appears I'm getting the distances between all 
 the points in the tree and itself, as though they were different trees.
 
 This is slow, takes a bunch of memory, and I then have to parse out the 
 list to find the ones that are paired up.
 
 Is there a way to get just the close ones from the single tree?
 
 thanks,
 
 -Chris

-- 

Peter Schmidtke

--
PhD Student at the Molecular Modeling and Bioinformatics Group
Dep. Physical Chemistry
Faculty of Pharmacy
University of Barcelona

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion