Re: [Numpy-discussion] finding close together points.
2009/11/16 Christopher Barker chris.bar...@noaa.gov: Anne Archibald wrote: 2009/11/13 Christopher Barker chris.bar...@noaa.gov: Wow! great -- you sounded interested, but I had no idea you'd run out and do it! thanks! we'll check it out. well, it turns out the Python version is unacceptably slow. However, we found we could use ckdtree, and use: tree.query(points, 2, # Number of points to return per point given. distance_upper_bound = distance_tolerance, ) This gives us a set of pairs of points closer than our distance tolerance -- it includes duplicates, of course, but even when we filter those in pure python, it's is nice and speedy. It looks like it would be pretty easy to add this to the Cython code, but it's working now for us, so... You probably noticed this, but I should remind you that if a point has more than one nearby neighbor, this may not give you what you wanted. In particular, if there are three points all within the fixed distance from each other, you may not find all three pairs this way. Anne Thanks again, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Anne Archibald wrote: 2009/11/10 Christopher Barker chris.bar...@noaa.gov: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is now implemented in SVN. Wow! great -- you sounded interested, but I had no idea you'd run out and do it! thanks! we'll check it out. I (tentatively?) used a set to store the collection of pairs, because my tree traversal is not smart enough to reliably uniquify the pairs without using sets. With more time and energy, I'm sure the algorithm could be improved to avoid using sets (both internally and on return), but I think that's something to save for the Cython version. I agree -- what's wrong with using a set? Thanks, we'll let you know how it works for us. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/13 Christopher Barker chris.bar...@noaa.gov: Anne Archibald wrote: 2009/11/10 Christopher Barker chris.bar...@noaa.gov: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is now implemented in SVN. Wow! great -- you sounded interested, but I had no idea you'd run out and do it! thanks! we'll check it out. No problem, it was a fairly minor modification of the two-tree code. I (tentatively?) used a set to store the collection of pairs, because my tree traversal is not smart enough to reliably uniquify the pairs without using sets. With more time and energy, I'm sure the algorithm could be improved to avoid using sets (both internally and on return), but I think that's something to save for the Cython version. I agree -- what's wrong with using a set? It should be possible to arrange the tree traversal algorithm so that each pair of subtrees is automatically traversed only once - akin to for i in range(n): for j in range(i): This would avoid having to build and modify a set every time you traverse the tree (sets are fairly cheap hash tables, so the cost is probably minor compared to other python overhead), and it would also naturally allow the result to be a list/inflatable array (which would save memory, if nothing else). But it would also require carefully thinking through the tree traversal algorithm. Thanks, we'll let you know how it works for us. Good luck! Anne -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Peter Schmidtke wrote: On Tue, 10 Nov 2009 16:07:32 -0800, Christopher Barker chris.bar...@noaa.gov wrote: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). How big is your set of points? Could be 100s of thousands, maybe ever millions. That's why O(N^2) is not good. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
David Cournapeau wrote: I would love having a core C library of containers - I'm all for that. However, I think that a very, very common need is simply for a growable numpy array. It seems this would actually be pretty darn easy (again, for someone familiar with the code!). IIUC, it would be exactly like a regular old ndarray, except: 1) The data block could be larger than the current size of the array: - this would require an additional attribute to carry that size 2) There would be an append method, or some other name: - this would increase the size of the array, manipulating .dimensions, and, if needed, doing a reallocation of the data pointer. - This is not very complex code, really, you simply allocate more than required, choosing some (user-resettable?) scale: 25% more than before, or whatever. re-allocating memory is actually pretty cheap these days, so it's not all that critical how you do it. 3) we'd probably want a .fit() method that would resize the data block to fit the current size of the array. In fact, as I write this, I wonder if these features could simply be added to the regular old ndarray -- if not used, they wouldn't be much overhead. Issues: - views on the same data block: as the data pointer would need to be re-allocated, you couldn't have views on the data block. Maybe this could be handles in the same way that ndarray.resize is handles now: you'd get an error if you try to resize an array that has a view to its data block. This is a bit tricky, as it's really easy to create a view without thinking about it (like by using ipython, for instance) - You could only grow the array in the first dimension (C-order, anyway). I think it could still be very useful, handling many of the common cases. - if you append to an array of more than one dimension, do you need to give the entire appended slice? Again, restrictive, but still useful. - Others? -- I'm not thinking of any right now. David Cournapeau wrote: If you need to manipulate core datatypes (float, etc...), I think python lists are a significant cost, because of pointer chasing in particular. Exactly, and add custom numpy dtypes as well. Putting python types and/or tuples of python types into a list can be far more memory intensive than putting them in a numpy array. That's why I built a python implementation of a growable array -- it does save memory, but it's a fair bit slower than lists -- when adding to it in python, you've got python types already, so the additional overhead of an extra function call, etc. is substantial. That's why I'd like a C version. Even more important is having one that is not only written in C (or Cython) but can be accessed directly from C (or Cython), so you can fill an array with core data types efficiently. Going from C-type to python object to put it in a list is substantial overhead. Stefan used some tricks to avoid using python list and use basic C structures in some custom code. Exactly -- and he's not he only one -- we're all re-inventing the wheel here. By the way -- what does the fromstring/fromfile code do when you don't give it a size to start with? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Thu, Nov 12, 2009 at 10:01 AM, Christopher Barker chris.bar...@noaa.govwrote: Peter Schmidtke wrote: On Tue, 10 Nov 2009 16:07:32 -0800, Christopher Barker chris.bar...@noaa.gov wrote: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). How big is your set of points? Could be 100s of thousands, maybe ever millions. That's why O(N^2) is not good. Another question is density. Are the points randomly scattered or do they cluster? The order N^2 part can be taken care of if you can easily eliminate a large fraction of the points, say by gridding the 2D part. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Lou Pecora wrote: Maybe I'm missing something simple, but if your array of 2D points is static, well, not quite. a KD tree for 2D nearest neighbor seems like over kill. You might want to try the simple approach of using boxes of points to narrow things down by sorting on the first component. yeah, we'll probably do something like that if we have to write the code ourselves. At the moment, we're using geohash: http://pypi.python.org/pypi/Geohash/ (this is for points on the earth) and it's working OK. I was just hoping kdtree would work out of the box! where for a static data set it can match KD trees in speed Why does it have to be static -- it doesn't look hard to insert/remove points. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Thu, Nov 12, 2009 at 10:32 AM, Christopher Barker chris.bar...@noaa.govwrote: David Cournapeau wrote: I would love having a core C library of containers - I'm all for that. However, I think that a very, very common need is simply for a growable numpy array. It seems this would actually be pretty darn easy (again, for someone familiar with the code!). IIUC, it would be exactly like a regular old ndarray, except: 1) The data block could be larger than the current size of the array: - this would require an additional attribute to carry that size 2) There would be an append method, or some other name: - this would increase the size of the array, manipulating .dimensions, and, if needed, doing a reallocation of the data pointer. - This is not very complex code, really, you simply allocate more than required, choosing some (user-resettable?) scale: 25% more than before, or whatever. re-allocating memory is actually pretty cheap these days, so it's not all that critical how you do it. 3) we'd probably want a .fit() method that would resize the data block to fit the current size of the array. In fact, as I write this, I wonder if these features could simply be added to the regular old ndarray -- if not used, they wouldn't be much overhead. I'm rapidly losing interest here. Perhaps you could supply some code implementing this new array? Why not just a class using an array that doubles the array size when an index is out of bounds and copies over the old data. That is pretty much what realloc does. As to python lists, do you have any benchmarks showing how bad python lists are compared to arrays? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Thu, Nov 12, 2009 at 11:39, Charles R Harris charlesr.har...@gmail.com wrote: On Thu, Nov 12, 2009 at 10:32 AM, Christopher Barker chris.bar...@noaa.gov wrote: David Cournapeau wrote: I would love having a core C library of containers - I'm all for that. However, I think that a very, very common need is simply for a growable numpy array. It seems this would actually be pretty darn easy (again, for someone familiar with the code!). IIUC, it would be exactly like a regular old ndarray, except: 1) The data block could be larger than the current size of the array: - this would require an additional attribute to carry that size 2) There would be an append method, or some other name: - this would increase the size of the array, manipulating .dimensions, and, if needed, doing a reallocation of the data pointer. - This is not very complex code, really, you simply allocate more than required, choosing some (user-resettable?) scale: 25% more than before, or whatever. re-allocating memory is actually pretty cheap these days, so it's not all that critical how you do it. 3) we'd probably want a .fit() method that would resize the data block to fit the current size of the array. In fact, as I write this, I wonder if these features could simply be added to the regular old ndarray -- if not used, they wouldn't be much overhead. I'm rapidly losing interest here. Perhaps you could supply some code implementing this new array? Why not just a class using an array that doubles the array size when an index is out of bounds and copies over the old data. That is pretty much what realloc does. As to python lists, do you have any benchmarks showing how bad python lists are compared to arrays? Didn't we already do this? http://www.mail-archive.com/numpy-discussion@scipy.org/msg21010.html -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
- Original Message From: Christopher Barker chris.bar...@noaa.gov To: Discussion of Numerical Python numpy-discussion@scipy.org Sent: Thu, November 12, 2009 12:37:37 PM Subject: Re: [Numpy-discussion] finding close together points. Lou Pecora wrote: a KD tree for 2D nearest neighbor seems like over kill. You might want to try the simple approach of using boxes of points to narrow things down by sorting on the first component. yeah, we'll probably do something like that if we have to write the code ourselves. At the moment, we're using geohash: http://pypi.python.org/pypi/Geohash/ (this is for points on the earth) and it's working OK. I was just hoping kdtree would work out of the box! where for a static data set it can match KD trees in speed Why does it have to be static -- it doesn't look hard to insert/remove points. --- Lou answers -- It doesn't have to be static, but if I remember correctly, when you add points you have to resort or do a smart insert (which may require a lot of data shifting). Not hard, but the original paper claimed that if there are a lot of changes to the data set this becomes more expensive than the kd-tree. If you can get the original paper I mentioned, it will have more info. It's pretty clearly written and contains some nice comparisons to kd-trees for finding near neighbors. Bottom line is that you can still try it with changing data. See what the run times are like. I've used the approach for up to eight dimensions with about 10^5 data points. It worked nicely. I remember looking for a kd-tree library that would work out of the box (C or C++), but everything I found was not as simple as I would have liked. The slice searching was a nice solution. But maybe I just didn't know where to look or I'm not understanding some of the libraries I did find. Let us know how it goes. -- Lou Pecora ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/12 Lou Pecora lou_boog2...@yahoo.com: - Original Message From: Christopher Barker chris.bar...@noaa.gov To: Discussion of Numerical Python numpy-discussion@scipy.org Sent: Thu, November 12, 2009 12:37:37 PM Subject: Re: [Numpy-discussion] finding close together points. Lou Pecora wrote: a KD tree for 2D nearest neighbor seems like over kill. You might want to try the simple approach of using boxes of points to narrow things down by sorting on the first component. yeah, we'll probably do something like that if we have to write the code ourselves. At the moment, we're using geohash: http://pypi.python.org/pypi/Geohash/ (this is for points on the earth) and it's working OK. I was just hoping kdtree would work out of the box! where for a static data set it can match KD trees in speed Why does it have to be static -- it doesn't look hard to insert/remove points. --- Lou answers -- It doesn't have to be static, but if I remember correctly, when you add points you have to resort or do a smart insert (which may require a lot of data shifting). Not hard, but the original paper claimed that if there are a lot of changes to the data set this becomes more expensive than the kd-tree. If you can get the original paper I mentioned, it will have more info. It's pretty clearly written and contains some nice comparisons to kd-trees for finding near neighbors. Bottom line is that you can still try it with changing data. See what the run times are like. I've used the approach for up to eight dimensions with about 10^5 data points. It worked nicely. I remember looking for a kd-tree library that would work out of the box (C or C++), but everything I found was not as simple as I would have liked. The slice searching was a nice solution. But maybe I just didn't know where to look or I'm not understanding some of the libraries I did find. Let us know how it goes. Just a quick comment: none of the KDTrees in scipy.spatial support any kind of modification right now, so if your data set changes you will need to rebuild them completely. Construction is fairly cheap, but it's very unlikely to compete with a data structure that supports modification algorithms. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Robert Kern wrote: Didn't we already do this? http://www.mail-archive.com/numpy-discussion@scipy.org/msg21010.html Indeed we did. What I posted then ( and have improved a bit now). Is a Python version. Written in Python, it has an advantage of using less memory for a big array, but is slower in other respects than a Python list. This is probably why we all use lists for this when we need it! What I'd like (and I have seen interest from others) is a C version, one that could be used from C code directly as well. I can't benchmark that, as it hasn't been written! I'm afraid I'm out of my depth as to how to write such a thing in C (at least so that it is well integrated into numpy). I suppose I could write a simple C accumulating array, and only convert it into a numpy array at the end (indeed, I have done that in the past), and benchmark that. Charles R Harris wrote: I'm rapidly losing interest here. Maybe that's why it hasn't been done yet -- it's not an interesting enough problem! Why not just a class using an array that doubles the array size when an index is out of bounds and copies over the old data. Yes, it pretty much is that simple -- that's my point. Simple and useful, and why should we all re-invent this wheel each time (even it if is a simple wheel)? At the C(ython) level, it's not that hard to simply do your accumulating in regular old C, and then convert to a numpy array, but it would be nice for it to be easier, particularly for various data-neutral code, particularly numpy custom-defined ones. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Thu, Nov 12, 2009 at 12:52 PM, Christopher Barker chris.bar...@noaa.govwrote: Robert Kern wrote: Didn't we already do this? http://www.mail-archive.com/numpy-discussion@scipy.org/msg21010.html Indeed we did. What I posted then ( and have improved a bit now). Is a Python version. Written in Python, it has an advantage of using less memory for a big array, but is slower in other respects than a Python list. This is probably why we all use lists for this when we need it! IIRC, python lists *are* expanding arrays of contiguous memory, not linked lists. They are efficient for appending and indexing, less so for deletion and insertion. It's one of those tradeoffs. So what you buy with an array implementation is the space/time efficiency of not having to allocate python types to put on the list. But you probably need to go through a python type at some point anyway, so... Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Thu, Nov 12, 2009 at 3:30 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Thu, Nov 12, 2009 at 12:52 PM, Christopher Barker chris.bar...@noaa.gov wrote: Robert Kern wrote: Didn't we already do this? http://www.mail-archive.com/numpy-discussion@scipy.org/msg21010.html Indeed we did. What I posted then ( and have improved a bit now). Is a Python version. Written in Python, it has an advantage of using less memory for a big array, but is slower in other respects than a Python list. This is probably why we all use lists for this when we need it! IIRC, python lists *are* expanding arrays of contiguous memory, not linked lists. They are efficient for appending and indexing, less so for deletion and insertion. It's one of those tradeoffs. So what you buy with an array implementation is the space/time efficiency of not having to allocate python types to put on the list. But you probably need to go through a python type at some point anyway, so... And here is a pep http://www.python.org/dev/peps/pep-3128/%20 for a Blist with a table of some of the performance tradeoffs. I think you could use small ndarrays in the leaf nodes and get around having to handle all the type stuff. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Charles R Harris wrote: So what you buy with an array implementation is the space/time efficiency of not having to allocate python types to put on the list. But you probably need to go through a python type at some point anyway, When writing Python, yes (though maybe not if you are appending a lot of data at once from a numpy array). That's whey my python implementation is slower, and I suspect a C implementation would be similar in performance to a list. But if you are writing a C extension in C or Cython, then you could get a real gain by not having to go through Python types. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Charles R Harris wrote: And here is a pep http://www.python.org/dev/peps/pep-3128/%20 That link was broken, try this one: http://www.python.org/dev/peps/pep-3128/ -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/11 Christopher Barker chris.bar...@noaa.gov: Anne Archibald wrote: 2009/11/10 Christopher Barker chris.bar...@noaa.gov: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is an eminently reasonable thing to want, and KDTree should support it. Unfortunately it doesn't. darn. It's slow both because you're using the python implementation rather than the C, I noticed that. The one good thing about using the python implementation rather than the Cython one is that you can subclass it, providing a new method. There's still a certain amount of boilerplate code to write, but it shouldn't be too bad. Can you give me a hint as to where to start? I have no idea how kd trees work. If this is still too slow, I have no problem incorporating additional code into cKDTree; the only reason none of the ball queries are in there is because ball queries must return variable-size results, so you lose a certain amount of speed because you're forced to manipulate python objects. But if there are relatively few result points, this need not be much of a slowdown. And certainly in this case, there would not be very many results, and lists are pretty fast. This is now implemented in SVN. I (tentatively?) used a set to store the collection of pairs, because my tree traversal is not smart enough to reliably uniquify the pairs without using sets. With more time and energy, I'm sure the algorithm could be improved to avoid using sets (both internally and on return), but I think that's something to save for the Cython version. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
- Original Message From: Christopher Barker chris.bar...@noaa.gov To: Discussion of Numerical Python numpy-discussion@scipy.org Sent: Tue, November 10, 2009 7:07:32 PM Subject: [Numpy-discussion] finding close together points. Hi all, I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). scipy.spatial.KDTree.query_ball_tree() seems like it's built for this. Chris, Maybe I'm missing something simple, but if your array of 2D points is static, a KD tree for 2D nearest neighbor seems like over kill. You might want to try the simple approach of using boxes of points to narrow things down by sorting on the first component. If your distances are, say, 10% of the variance, then you'll *roughly* decrease the remaining points to search by a factor of 10. This can get more sophisticated and is useful in higher dimensions (see: Sameer A. Nene and Shree K. Nayar, A Simple Algorithm for Nearest Neighbor Search in High Dimensions, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19 (9), 989 (1997).) where for a static data set it can match KD trees in speed, but is SO much easier to program. -- Lou Pecora, my views are my own. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Tue, Nov 10, 2009 at 11:15 PM, David Cournapeau da...@ar.media.kyoto-u.ac.jp wrote: Charles R Harris wrote: I think Python lists are basically just expanding arrays and pointers are cheap. Where you might lose is in creating python objects to put in the list and not having ufuncs and the rest of the numpy machinery. If you don't need the machinery, lists are probably not a bad way to go. I am not familiar enough with the kdtree code: if you need to manipulate core datatypes (float, etc...), I think python lists are a significant cost, because of pointer chasing in particular. It may or may not apply, but Stefan used some tricks to avoid using python list and use basic C structures in some custom code (written in cython), maybe he has something to say. I would love having a core C library of containers - there are a few candidates we could take code from: I've got a cythonized structure for union-find (relations) and we could put together a heap pretty easily from heap sort. I've also got one for the problem at hand, but it is for several thousand points (stars) scattered over an image and uses c++ lists. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/10 Christopher Barker chris.bar...@noaa.gov: Hi all, I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is an eminently reasonable thing to want, and KDTree should support it. Unfortunately it doesn't. scipy.spatial.KDTree.query_ball_tree() seems like it's built for this. However, I'm a bit confused. The first argument is a kdtree, but I'm calling it as a method of a kdtree -- I want to know which points in the tree I already have are closer that some r from each-other. If I call it as: tree.query_ball_tree(tree, r) I get a big list, that has all the points in it (some of them paired up with close neighbors.) It appears I'm getting the distances between all the points in the tree and itself, as though they were different trees. This is slow, takes a bunch of memory, and I then have to parse out the list to find the ones that are paired up. Is there a way to get just the close ones from the single tree? Unfortunately not at the moment. It's slow both because you're using the python implementation rather than the C, and because you're getting all pairs where pair includes pairing a point with itself (and also each distinct pair in both orders). The tree really should allow self-queries that don't return the point and itself. The one good thing about using the python implementation rather than the Cython one is that you can subclass it, providing a new method. There's still a certain amount of boilerplate code to write, but it shouldn't be too bad. If this is still too slow, I have no problem incorporating additional code into cKDTree; the only reason none of the ball queries are in there is because ball queries must return variable-size results, so you lose a certain amount of speed because you're forced to manipulate python objects. But if there are relatively few result points, this need not be much of a slowdown. Anne thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Tue, Nov 10, 2009 at 7:07 PM, Christopher Barker chris.bar...@noaa.gov wrote: Hi all, I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). scipy.spatial.KDTree.query_ball_tree() seems like it's built for this. In some cases a brute-force approach is also good. If r is a matrix of shape Nx2: (r*r).sum(axis=1) -2 * numpy.dot(r, r.T) + (r*r).sum(axis=1).reshape((r.shape[0], 1)) thresh**2 It's brute force, but it takes advantage of fast matrix multiplication. -- http://www-etud.iro.umontreal.ca/~bergstrj ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
James Bergstra wrote: In some cases a brute-force approach is also good. true. If r is a matrix of shape Nx2: (r*r).sum(axis=1) -2 * numpy.dot(r, r.T) + (r*r).sum(axis=1).reshape((r.shape[0], 1)) thresh**2 It's brute force, but it takes advantage of fast matrix multiplication. I'm more concerned about memory -- doesn't this use N^2 memory? Which could be an issue here. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Tue, Nov 10, 2009 at 8:17 PM, Christopher Barker chris.bar...@noaa.gov wrote: James Bergstra wrote: In some cases a brute-force approach is also good. true. If r is a matrix of shape Nx2: (r*r).sum(axis=1) -2 * numpy.dot(r, r.T) + (r*r).sum(axis=1).reshape((r.shape[0], 1)) thresh**2 It's brute force, but it takes advantage of fast matrix multiplication. I'm more concerned about memory -- doesn't this use N^2 memory? Which could be an issue here. Yes, this uses N^2 time and space. It's not a good algorithm when N is large. -- http://www-etud.iro.umontreal.ca/~bergstrj ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Also, is it not returning distances between points and themselves? Or am I misinterpreting it? DG On Tue, Nov 10, 2009 at 5:17 PM, Christopher Barker chris.bar...@noaa.govwrote: James Bergstra wrote: In some cases a brute-force approach is also good. true. If r is a matrix of shape Nx2: (r*r).sum(axis=1) -2 * numpy.dot(r, r.T) + (r*r).sum(axis=1).reshape((r.shape[0], 1)) thresh**2 It's brute force, but it takes advantage of fast matrix multiplication. I'm more concerned about memory -- doesn't this use N^2 memory? Which could be an issue here. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Tue, Nov 10, 2009 at 7:48 PM, Anne Archibald peridot.face...@gmail.com wrote: 2009/11/10 Christopher Barker chris.bar...@noaa.gov: Hi all, I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is an eminently reasonable thing to want, and KDTree should support it. Unfortunately it doesn't. scipy.spatial.KDTree.query_ball_tree() seems like it's built for this. However, I'm a bit confused. The first argument is a kdtree, but I'm calling it as a method of a kdtree -- I want to know which points in the tree I already have are closer that some r from each-other. If I call it as: tree.query_ball_tree(tree, r) I get a big list, that has all the points in it (some of them paired up with close neighbors.) It appears I'm getting the distances between all the points in the tree and itself, as though they were different trees. This is slow, takes a bunch of memory, and I then have to parse out the list to find the ones that are paired up. Is there a way to get just the close ones from the single tree? I used sparse_distance_matrix for the distance of a kdtree to itself in the past. Since it's a matrix it should be possible to just get the lower or upper triangle, and it's sparse so memory is not so much of a problem. But I remember it was also slow and only worth using if the matrix is large. Josef Unfortunately not at the moment. It's slow both because you're using the python implementation rather than the C, and because you're getting all pairs where pair includes pairing a point with itself (and also each distinct pair in both orders). The tree really should allow self-queries that don't return the point and itself. The one good thing about using the python implementation rather than the Cython one is that you can subclass it, providing a new method. There's still a certain amount of boilerplate code to write, but it shouldn't be too bad. If this is still too slow, I have no problem incorporating additional code into cKDTree; the only reason none of the ball queries are in there is because ball queries must return variable-size results, so you lose a certain amount of speed because you're forced to manipulate python objects. But if there are relatively few result points, this need not be much of a slowdown. Anne thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Anne Archibald wrote: 2009/11/10 Christopher Barker chris.bar...@noaa.gov: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is an eminently reasonable thing to want, and KDTree should support it. Unfortunately it doesn't. darn. It's slow both because you're using the python implementation rather than the C, I noticed that. The one good thing about using the python implementation rather than the Cython one is that you can subclass it, providing a new method. There's still a certain amount of boilerplate code to write, but it shouldn't be too bad. Can you give me a hint as to where to start? I have no idea how kd trees work. If this is still too slow, I have no problem incorporating additional code into cKDTree; the only reason none of the ball queries are in there is because ball queries must return variable-size results, so you lose a certain amount of speed because you're forced to manipulate python objects. But if there are relatively few result points, this need not be much of a slowdown. And certainly in this case, there would not be very many results, and lists are pretty fast. However, this does bring up the need for a growable numpy array. I've been working on one in Python, but it would be really nice to have one that could be accessed by C (or cython) code. It wouldn't be that hard to do (for someone familiar with the numpy code, anyway), I don't think. It is mostly boiler plate in Python. I'm imagining it would only support contiguous arrays, and could only grow is the first dimension. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Tue, Nov 10, 2009 at 10:51 PM, Christopher Barker chris.bar...@noaa.govwrote: Anne Archibald wrote: 2009/11/10 Christopher Barker chris.bar...@noaa.gov: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is an eminently reasonable thing to want, and KDTree should support it. Unfortunately it doesn't. darn. It's slow both because you're using the python implementation rather than the C, I noticed that. The one good thing about using the python implementation rather than the Cython one is that you can subclass it, providing a new method. There's still a certain amount of boilerplate code to write, but it shouldn't be too bad. Can you give me a hint as to where to start? I have no idea how kd trees work. If this is still too slow, I have no problem incorporating additional code into cKDTree; the only reason none of the ball queries are in there is because ball queries must return variable-size results, so you lose a certain amount of speed because you're forced to manipulate python objects. But if there are relatively few result points, this need not be much of a slowdown. And certainly in this case, there would not be very many results, and lists are pretty fast. However, this does bring up the need for a growable numpy array. I've been working on one in Python, but it would be really nice to have one that could be accessed by C (or cython) code. It wouldn't be that hard to do (for someone familiar with the numpy code, anyway), I don't think. It is mostly boiler plate in Python. I'm imagining it would only support contiguous arrays, and could only grow is the first dimension. I think Python lists are basically just expanding arrays and pointers are cheap. Where you might lose is in creating python objects to put in the list and not having ufuncs and the rest of the numpy machinery. If you don't need the machinery, lists are probably not a bad way to go. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Charles R Harris wrote: I think Python lists are basically just expanding arrays and pointers are cheap. Where you might lose is in creating python objects to put in the list and not having ufuncs and the rest of the numpy machinery. If you don't need the machinery, lists are probably not a bad way to go. I am not familiar enough with the kdtree code: if you need to manipulate core datatypes (float, etc...), I think python lists are a significant cost, because of pointer chasing in particular. It may or may not apply, but Stefan used some tricks to avoid using python list and use basic C structures in some custom code (written in cython), maybe he has something to say. I would love having a core C library of containers - there are a few candidates we could take code from: - http://github.com/stevedekorte/basekit: used in IO programming language. Quite a few goodies, from list to trees to sets to portable code to load dynamically loaded libraries. BSD - The Apple Core Foundation (I have to check the Apple license is ok). It is more complex, but is designed with objective-C in mind, meaning integration with the C python API may be easier (both Obj-C and Python C API being based on ref counting). - Another one mentioned by Robert Kern, I forgot the reference. I hope to make it my main focus for the 1.5 release, cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Wed, Nov 11, 2009 at 01:15, David Cournapeau da...@ar.media.kyoto-u.ac.jp wrote: - The Apple Core Foundation (I have to check the Apple license is ok). No. The APSL is not DFSG-free. It is more complex, but is designed with objective-C in mind, meaning integration with the C python API may be easier (both Obj-C and Python C API being based on ref counting). - Another one mentioned by Robert Kern, I forgot the reference. http://c-algorithms.sourceforge.net/ -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
Robert Kern wrote: No. The APSL is not DFSG-free. It was too good to be true, I guess. http://c-algorithms.sourceforge.net/ Thanks for the link, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
On Tue, 10 Nov 2009 16:07:32 -0800, Christopher Barker chris.bar...@noaa.gov wrote: Hi all, I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). How big is your set of points? scipy.spatial.KDTree.query_ball_tree() seems like it's built for this. However, I'm a bit confused. The first argument is a kdtree, but I'm calling it as a method of a kdtree -- I want to know which points in the tree I already have are closer that some r from each-other. If I call it as: tree.query_ball_tree(tree, r) I get a big list, that has all the points in it (some of them paired up with close neighbors.) It appears I'm getting the distances between all the points in the tree and itself, as though they were different trees. This is slow, takes a bunch of memory, and I then have to parse out the list to find the ones that are paired up. Is there a way to get just the close ones from the single tree? thanks, -Chris -- Peter Schmidtke -- PhD Student at the Molecular Modeling and Bioinformatics Group Dep. Physical Chemistry Faculty of Pharmacy University of Barcelona ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion