[Numpy-discussion] copying array to itself
I'm need to do some shifting of data within an array and am using the following code: for p in numpy.arange(array.shape[0], dtype='int64'): for q in numpy.arange(array.shape[1]): # A positive shift is towards zero shift = shift_values[p, q] if shift = 0: copy_len = array.shape[2] - shift array[p, q, :copy_len] = array[p, q, shift:] array[p, q, copy_len:] = 0 else: neg_shift = -shift copy_len = array.shape[2] - neg_shift array[p, q, neg_shift:] = array[p, q, :copy_len] array[p, q, :neg_shift] = 0 In essence, if shift is positive, it copies the data to a block closer the zero index on the last dimension, and if its negative, it copies it to a block further away. The problem I've encountered is that in for some values of p and q (which is consistent), it simply writes deterministic and consistent garbage (a repeated pattern around around -0.002) to that last dimension to which I'm trying to copy. Most of the values of p and q work fine and as expected. I can solve the problem with an interim copy. Is this some nuance of the way numpy does things? Or am I missing some stupid bug in my code? cheers, Henry ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] sparse array data
Hi all, I'm currently writing a code that needs three dimensional data (for the physicists it's dimensions are atom, ion, level). The problem is that not all combinations do exist (a sparse array). Sparse matrices in scipy only deal with two dimensions. The operations that I need to do on those are running functions like exp(item/constant) on all of the items. I also want to sum them up in the last dimension. What's the best way to make a class that takes this kind of data and does the required operations fast. Maybe some phycisists have implemented these things already. Any thoughts? Cheers Wolfgang ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
what about numpy.ma? Those are marked array. But they won't be the fastest. Fred On Wed, May 2, 2012 at 12:16 PM, Wolfgang Kerzendorf wkerzend...@gmail.com wrote: Hi all, I'm currently writing a code that needs three dimensional data (for the physicists it's dimensions are atom, ion, level). The problem is that not all combinations do exist (a sparse array). Sparse matrices in scipy only deal with two dimensions. The operations that I need to do on those are running functions like exp(item/constant) on all of the items. I also want to sum them up in the last dimension. What's the best way to make a class that takes this kind of data and does the required operations fast. Maybe some phycisists have implemented these things already. Any thoughts? Cheers Wolfgang ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] record arrays and vectorizing
Hello, Can somebody give me some hints as to how to code up this function in pure python, rather than dropping down to Fortran? I will want to compare a 7-element vector (called element) to a large list of similarly-dimensioned vectors (called target, and pick out the vector in target that is the closest to element (determined by minimizing the Euclidean distance). For instance, in (slow) brute force form it would look like: element = numpy.array([1, 2, 3, 4, 5, 6, 7]) target = numpy.array(range(0, 49)).reshape(7,7)*0.1 min_length = .0 min_index = for i in xrange(0, 7): distance = (element-target)**2 distance = numpy.sqrt(distance.sum()) if (distance min_length): min_length = distance min_index = i Now of course, the actual problem will be of a much larger scale. I will have an array of elements, and a large number of potential targets. I was thinking of having element be an array where each element itself is a numpy.ndarray, and then vectorizing the code above so as an output I would have an array of the min_index and min_length values. I can get the following simple test to work so I may be on the right track: import numpy dtype = [(x, numpy.ndarray)] def single(data): return data[0].min() multiple = numpy.vectorize(single) if __name__ == __main__: a = numpy.arange(0, 16).reshape(4,4) b = numpy.recarray((4), dtype=dtype) for i in xrange(0, b.shape[0]): b[i][x] = a[i,:] print a print b x = multiple(b) print x What is the best way of constructing b from a? I tried b = numpy.recarray((4), dtype=dtype, buf=a) but I get a segmentation fault when I try to print b. Is there a way to perform this larger task efficiently with record arrays and vectorization, or am I off on the wrong track completely? How can I do this efficiently without dropping down to Fortran? Thanks for any advice, Catherine ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays and vectorizing
On Wed, May 2, 2012 at 11:06 AM, Moroney, Catherine M (388D) catherine.m.moro...@jpl.nasa.gov wrote: I will want to compare a 7-element vector (called element) to a large list of similarly-dimensioned vectors (called target, and pick out the vector in target that is the closest to element (determined by minimizing the Euclidean distance). It's not entirely clear what you mean from the description above. In the code example, you return a single index, but from the description it sounds like you want to pick out a vector? If you need multiple answers, one for each element, then you probably need to do broadcasting as shown in the NumPy medkit: http://mentat.za.net/numpy/numpy_advanced_slides/ Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] copying array to itself
On Wed, May 2, 2012 at 9:03 AM, Henry Gomersall h...@cantab.net wrote: Is this some nuance of the way numpy does things? Or am I missing some stupid bug in my code? Try playing with the parameters of the following code: sz = 1 N = 10 import numpy as np x = np.arange(sz) y = x.copy() x[:-N] = x[N:] np.testing.assert_equal(x[:-N], y[N:]) For small values of sz this typically works, but as soon as numpy needs to buffer strange things happen because you are reading from memory locations that you've already written to. Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On 5/2/12 11:16 AM, Wolfgang Kerzendorf wrote: Hi all, I'm currently writing a code that needs three dimensional data (for the physicists it's dimensions are atom, ion, level). The problem is that not all combinations do exist (a sparse array). Sparse matrices in scipy only deal with two dimensions. The operations that I need to do on those are running functions like exp(item/constant) on all of the items. I also want to sum them up in the last dimension. What's the best way to make a class that takes this kind of data and does the required operations fast. Maybe some phycisists have implemented these things already. Any thoughts? Curiously enough, I have recently been discussing with Travis O. about how to represent sparse matrices with complete generality. One of the possibilities is to use what Travis call synthetic dimensions. The idea behind it is easy: use a table with as many columns as dimensions, and add another one for the actual values of the array. For a 3-D sparse array, this looks like: dim0 | dim1 | dim2 | value == 0 | 0 | 0 | val0 0 | 10 | 100 | val1 20 | 5 | 202 | val2 You can use any package that deals with tables for implementing such a thing. I'm going to quickly describe a raw implementation of this on top of carray [1], not only because I'm the author, but also because it adapts well to the needs you exposed. [1] https://github.com/FrancescAlted/carray Let's start with a small array with shape (2,5). We are going to use a dense array for this, mainly for comparison purposes with typical NumPy arrays, but of course the logic behind this can be extended to multidimensional sparse arrays with complete generality. In [1]: import carray as ca In [2]: import numpy as np In [3]: syn_dtype = np.dtype([('dim0', np.uint32), ('dim1', np.uint32), ('value', np.float64)]) In [4]: N = 10 In [6]: ct = ca.fromiter(((i/2, i%2, i*i) for i in xrange(N)), dtype=syn_dtype, count=N) In [7]: ct Out[7]: ctable((10,), |V16) nbytes: 160; cbytes: 12.00 KB; ratio: 0.01 cparams := cparams(clevel=5, shuffle=True) [(0, 0, 0.0) (0, 1, 1.0) (1, 0, 4.0) (1, 1, 9.0) (2, 0, 16.0) (2, 1, 25.0) (3, 0, 36.0) (3, 1, 49.0) (4, 0, 64.0) (4, 1, 81.0)] Okay, we have our small array. Now, let's apply a function for the values (in this case the log()): In [8]: ct['value'][:] = ct.eval('log(value)') In [9]: ct Out[9]: ctable((10,), |V16) nbytes: 160; cbytes: 12.00 KB; ratio: 0.01 cparams := cparams(clevel=5, shuffle=True) [(0, 0, -inf) (0, 1, 0.0) (1, 0, 1.3862943611198906) (1, 1, 2.1972245773362196) (2, 0, 2.772588722239781) (2, 1, 3.2188758248682006) (3, 0, 3.58351893845611) (3, 1, 3.8918202981106265) (4, 0, 4.1588830833596715) (4, 1, 4.394449154672439)] carray uses numexpr behind the scenes, so these operations are very fast. Also, for functions not supported inside numexpr, carray can also make use of the ones in NumPy (although these are typically not as efficient). Let's see how to do sums in different axis. For this, we will use the selection capabilities in the ctable object. Let's do the sum in the last axis first: In [10]: [ sum(row.value for row in ct.where('(dim0==%d)' % (i,))) for i in range(N/2) ] Out[10]: [-inf, 3.58351893845611, 5.991464547107982, 7.475339236566736, 8.55333223803211] So, it is just a matter of summing over dim1 while keeping dim0 fixed. One can check that the results are the same than for NumPy: In [11]: t = np.fromiter((np.log(i*i) for i in xrange(N)), dtype='f8').reshape(N/2,2) In [12]: t.sum(axis=1) Out[12]: array([ -inf, 3.58351894, 5.99146455, 7.47533924, 8.55333224]) Summing over the leading dimension means keeping dim1 fixed: In [13]: [ sum(row.value for row in ct.where('(dim1==%d)' % (i,))) for i in range(2) ] Out[13]: [-inf, 13.702369854987484] and again, this is the same than using the `axis=0` parameter: In [14]: t.sum(axis=0) Out[14]: array([-inf, 13.70236985]) Summing everything is, as expected, the easiest: In [15]: sum(row.value for row in ct.iter()) Out[15]: -inf In [16]: t.sum() Out[16]: -inf Of course, the case for more dimensions requires a bit more complexity, but nothing fancy (this is left as an exercise for the reader ;). In case you are going to use this in your package, you may want to create wrappers that would access the different functionality more easily. Finally, you should note that I used 4-byte integers for representing the dimensions. If this is not enough, you can use 8-byte integers too. As the carray objects are compressed by default, this usually doesn't take a lot of space. For example, for an array with 1 million elements: In [31]: ct = ca.fromiter(((i/2, i%2, i*i) for i in xrange(N)), dtype=syn_dtype, count=N) In [32]: ct Out[32]: ctable((100,), |V16) nbytes: 15.26 MB; cbytes: 1.76 MB; ratio: 8.67 cparams := cparams(clevel=5, shuffle=True) [(0, 0, 0.0), (0, 1, 1.0), (1,
Re: [Numpy-discussion] Continuous Integration
I've been working on setting up a new buildbot for NumPy. Unfortunately, I don't have much time to work on it, so it's slow going! Right now I am still at the stage of getting NumPy to pass all its tests on the machines I'm using as test slaves. After that, I plan to transfer existing slaves to the new setup, and then maybe ask for new volunteer slave machines (if people think the buildbot setup is useful). Hi, If there are things one can contribute to help the development of the buildbot for NumPy, I would be happy to participate! David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
Hi Francesc On Wed, May 2, 2012 at 1:53 PM, Francesc Alted franc...@continuum.io wrote: and add another one for the actual values of the array. For a 3-D sparse array, this looks like: dim0 | dim1 | dim2 | value == 0 | 0 | 0 | val0 0 | 10 | 100 | val1 20 | 5 | 202 | val2 What's the distinction between this and a coo_matrix? Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On Wed, May 2, 2012 at 9:53 PM, Francesc Alted franc...@continuum.io wrote: On 5/2/12 11:16 AM, Wolfgang Kerzendorf wrote: Hi all, I'm currently writing a code that needs three dimensional data (for the physicists it's dimensions are atom, ion, level). The problem is that not all combinations do exist (a sparse array). Sparse matrices in scipy only deal with two dimensions. The operations that I need to do on those are running functions like exp(item/constant) on all of the items. I also want to sum them up in the last dimension. What's the best way to make a class that takes this kind of data and does the required operations fast. Maybe some phycisists have implemented these things already. Any thoughts? Curiously enough, I have recently been discussing with Travis O. about how to represent sparse matrices with complete generality. One of the possibilities is to use what Travis call synthetic dimensions. The idea behind it is easy: use a table with as many columns as dimensions, and add another one for the actual values of the array. For a 3-D sparse array, this looks like: dim0 | dim1 | dim2 | value == 0 | 0 | 0 | val0 0 | 10 | 100 | val1 20 | 5 | 202 | val2 This coordinate format is also what's used by the MATLAB Tensor Toolbox. They have a paper justifying this choice and describing some tricks for how to work with them: http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1 (Spoiler: you use a lot of sort operations. Conveniently, timsort appears to be perfectly adapted for their algorithmic requirements.) I'm not sure why one would make up a new term like synthetic dimensions though, it's just standard coordinate format... Though, for the original poster, depending on their exact problem, they might be better off just using a list or object ndarray of scipy.sparse matrices. Or some coordinate arrays like above, plus add.reduceat for the sums they mentioned. -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Issue Tracking
On Wed, May 2, 2012 at 9:48 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, May 1, 2012 at 11:47 PM, Ralf Gommers ralf.gomm...@googlemail.com wrote: On Wed, May 2, 2012 at 1:48 AM, Pauli Virtanen p...@iki.fi wrote: 01.05.2012 21:34, Ralf Gommers kirjoitti: [clip] At this point it's probably good to look again at the problems we want to solve: 1. responsive user interface (must absolutely have) Now that it comes too late: with some luck, I've possibly hit on what was ailing the Tracs (max_diff_bytes configured too large). Let's see if things work better from now on... That's amazing - not only does it not give errors anymore, it's also an order of magnitude faster. So maybe we could just stick with trac. Performance was really the sticking point. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion FWIW I'm pretty strongly in favor of GHI for NumPy/SciPy (I am going to get involved in NumPy dev eventually, promise). While warty in some of the places already mentioned, I have found it to be very low-friction and low-annoyance in my own dev process (nearing 1000 issues closed in the last year in pandas). But there are fewer cooks in the kitchen with pandas so perhaps this experience wouldn't be identical with NumPy. The biggest benefit I've seen is community involvement that you really wouldn't see if I were using a Trac or something else hosted elsewhere. Users are on GitHub and it for some reason gives people a feeling of engagement in the open source process that I don't see anywhere else. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] record arrays initialization
Thanks to Perry for some very useful off-list conversation. I realize that I wasn't being clear at all in my earlier description of the problem so here it is in a nutshell: Find the best match in an array t(5000, 7) for a single vector e(7). Now scale it up so e is (128, 512, 7) and I want to return a (128, 512) array of the t-identifiers that are the best match for e. Best match is defined as the minimum Euclidean distance. I'm going to try three ways: (a) brute force and lots of looping in python, (b) constructing a function to find the match for a single instance of e and vectorizing it, and (c) coding it in Fortran. I'll be curious to see the performance figures. Two smaller questions: A) How do I most efficiently construct a record array from a single array? I want to do the following, but it segfaults on me when i try to print b. vtype = [(x, numpy.ndarray)] a = numpy.arange(0, 16).reshape(4,4) b = numpy.recarray((4), dtype=vtype, buf=a) print a print b What is the most efficient way of constructing b from the values of a? In real-life, a is (128*512*7) and I want b to be (128, 512) with the x component being a 7-value numpy array. and B) If I'm vectorizing a function (single) to find the best match for a single element of e within t, how do I pass the entire array t into the function without having it parcelled down to its individual elements? i.e. def single(elements, targets): nlen = element.shape[0] nvec = targets.data.shape[0] x = element.reshape(1, nlen).repeat(nvec, axis=0) diffs = ((x - targets.data)**2).sum(axis=1) diffs = numpy.sqrt(diffs) return numpy.argmin(diffs, axis=0) multiple = numpy.vectorize(single) x = multiple(all_elements, target) where all_elements is similar to b in my first example, and target is a 2-d array. The above code doesn't work because target gets reduced to a single element when it gets down to single and I need to see the whole array when I'm down in single. I found a work-around by encapsulating target into a single object and passing in the object, but I'm curious if there's a better way of doing this. I hope I've explained myself better this time around, Catherine ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays and vectorizing
On Wed, May 2, 2012 at 1:06 PM, Moroney, Catherine M (388D) catherine.m.moro...@jpl.nasa.gov wrote: Hello, Can somebody give me some hints as to how to code up this function in pure python, rather than dropping down to Fortran? I will want to compare a 7-element vector (called element) to a large list of similarly-dimensioned vectors (called target, and pick out the vector in target that is the closest to element (determined by minimizing the Euclidean distance). For instance, in (slow) brute force form it would look like: element = numpy.array([1, 2, 3, 4, 5, 6, 7]) target = numpy.array(range(0, 49)).reshape(7,7)*0.1 min_length = .0 min_index = for i in xrange(0, 7): distance = (element-target)**2 distance = numpy.sqrt(distance.sum()) if (distance min_length): min_length = distance min_index = i If you are just trying to find the index to the vector in target that is closest to element, then I think the default broadcasting would work fine. Here is an example that should work (the broadcasting is done for the subtraction element-targets): In [39]: element = np.arange(1,8) In [40]: targets = np.random.uniform(0,8,(1000,7)) In [41]: distance_squared = ((element-targets)**2).sum(1) In [42]: min_index = distance_squared.argmin() In [43]: element Out[43]: array([1, 2, 3, 4, 5, 6, 7]) In [44]: targets[min_index,:] Out[44]: array([ 1.93625981, 2.56137284, 2.23395169, 4.15215253, 3.96478248, 5.21829915, 5.13049489]) Note - depending on the number of vectors in targets, it might be better to have everything transposed if you are really worried about the timing; you'd need to try that for your particular case. Hope that helps, Aronne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Continuous Integration
David Froger david.froger at gmail.com writes: I've been working on setting up a new buildbot for NumPy. Unfortunately, I don't have much time to work on it, so it's slow going! ... Hi, If there are things one can contribute to help the development of the buildbot for NumPy, I would be happy to participate! Great! I've sent you a message off the list so we can coordinate further. Chris ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On 5/2/12 4:07 PM, Stéfan van der Walt wrote: Hi Francesc On Wed, May 2, 2012 at 1:53 PM, Francesc Altedfranc...@continuum.io wrote: and add another one for the actual values of the array. For a 3-D sparse array, this looks like: dim0 | dim1 | dim2 | value == 0 | 0 | 0 | val0 0 | 10 | 100 | val1 20 | 5 | 202 | val2 What's the distinction between this and a coo_matrix? Well, as the OP said, coo_matrix does not support dimensions larger than 2, right? In [4]: coo_matrix((3,4), dtype=np.int8).todense() Out[4]: matrix([[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], dtype=int8) In [5]: coo_matrix((2,3,2), dtype=np.int8).todense() --- TypeError Traceback (most recent call last) /Users/faltet/ipython-input-5-46b8ae62259f in module() 1 coo_matrix((2,3,2), dtype=np.int8).todense() /Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy) 127 obj, ij = arg1 128 except: -- 129 raise TypeError('invalid input format') 130 131 try: TypeError: invalid input format -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays initialization
On Wed, May 2, 2012 at 2:45 PM, Moroney, Catherine M (388D) catherine.m.moro...@jpl.nasa.gov wrote: Find the best match in an array t(5000, 7) for a single vector e(7). Now scale it up so e is (128, 512, 7) and I want to return a (128, 512) array of the t-identifiers that are the best match for e. Best match is defined as the minimum Euclidean distance. I'm going to try three ways: (a) brute force and lots of looping in python, (b) constructing a function to find the match for a single instance of e and vectorizing it, and (c) coding it in Fortran. I'll be curious to see the performance figures. I'd use a mixture of (a) and (b): break the t(N, 7) up into blocks of, say, (1000, 7), compute the best match in each using broadcasting, and then combine your results to find the best of the best. This strategy should be best for very large N. For moderate N, where broadcasting easily fits into memory, the answer given by the OP to your original email would do the trick. A) How do I most efficiently construct a record array from a single array? I want to do the following, but it segfaults on me when i try to print b. vtype = [(x, numpy.ndarray)] a = numpy.arange(0, 16).reshape(4,4) b = numpy.recarray((4), dtype=vtype, buf=a) I prefer not to use record arrays, and stick to structured arrays: In [11]: vtype = np.dtype([('x', (np.float, 4))]) In [12]: a = np.arange(16.).reshape((4,4)) In [13]: a.view(vtype) Out[13]: array([[([0.0, 1.0, 2.0, 3.0],)], [([4.0, 5.0, 6.0, 7.0],)], [([8.0, 9.0, 10.0, 11.0],)], [([12.0, 13.0, 14.0, 15.0],)]], dtype=[('x', 'f8', (4,))]) B) If I'm vectorizing a function (single) to find the best match for a single element of e within t, how do I pass the entire array t into the function without having it parcelled down to its individual elements? I think the new dtype just makes your life more difficult here. Simply do: In [49]: np.sum(a - elements.T, axis=1) Out[49]: array([ 0., 16., 32., 48.]) Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On 5/2/12 4:20 PM, Nathaniel Smith wrote: On Wed, May 2, 2012 at 9:53 PM, Francesc Altedfranc...@continuum.io wrote: On 5/2/12 11:16 AM, Wolfgang Kerzendorf wrote: Hi all, I'm currently writing a code that needs three dimensional data (for the physicists it's dimensions are atom, ion, level). The problem is that not all combinations do exist (a sparse array). Sparse matrices in scipy only deal with two dimensions. The operations that I need to do on those are running functions like exp(item/constant) on all of the items. I also want to sum them up in the last dimension. What's the best way to make a class that takes this kind of data and does the required operations fast. Maybe some phycisists have implemented these things already. Any thoughts? Curiously enough, I have recently been discussing with Travis O. about how to represent sparse matrices with complete generality. One of the possibilities is to use what Travis call synthetic dimensions. The idea behind it is easy: use a table with as many columns as dimensions, and add another one for the actual values of the array. For a 3-D sparse array, this looks like: dim0 | dim1 | dim2 | value == 0 | 0 | 0 | val0 0 | 10 | 100 | val1 20 | 5 | 202 | val2 This coordinate format is also what's used by the MATLAB Tensor Toolbox. They have a paper justifying this choice and describing some tricks for how to work with them: http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1 (Spoiler: you use a lot of sort operations. Conveniently, timsort appears to be perfectly adapted for their algorithmic requirements.) Uh, I do not have access to the article, but yes, sorting makes a lot of sense for these scenarios (this is why I was suggesting using indexes, which is sort of sorting too). I'm not sure why one would make up a new term like synthetic dimensions though, it's just standard coordinate format... Yeah, this seems a bit weird. Perhaps Travis was referring to other things and I mixed concepts. Anyways. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays initialization
On Wed, May 2, 2012 at 5:45 PM, Moroney, Catherine M (388D) catherine.m.moro...@jpl.nasa.gov wrote: Thanks to Perry for some very useful off-list conversation. I realize that I wasn't being clear at all in my earlier description of the problem so here it is in a nutshell: Find the best match in an array t(5000, 7) for a single vector e(7). Now scale it up so e is (128, 512, 7) and I want to return a (128, 512) array of the t-identifiers that are the best match for e. Best match is defined as the minimum Euclidean distance. It sounds like you want to find the nearest neighbor to a point in a high-dimensional space. This sounds like a job for a spacial data structure like a KD-tree. See: http://docs.scipy.org/doc/scipy/reference/spatial.html http://mloss.org/software/view/143/ http://www.mrzv.org/software/pyANN/ etc. -Kevin ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On Wed, May 2, 2012 at 3:20 PM, Francesc Alted franc...@continuum.io wrote: On 5/2/12 4:07 PM, Stéfan van der Walt wrote: Well, as the OP said, coo_matrix does not support dimensions larger than 2, right? That's just an implementation detail, I would imagine--I'm trying to figure out if there is a new principle behind synthetic dimensions? By the way, David Cournapeau mentioned using b-trees for sparse ops a while ago; did you ever talk to him about those ideas? BTW, this coo-type storage is used in Stanford's probabilistic graphical models course, but for dense data (like we have in the course) it's a pain. Writing code in both Octave and Python, I again came to realize what a very elegant N-dimensional structure the numpy array exposes! Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On 5/2/12 5:28 PM, Stéfan van der Walt wrote: On Wed, May 2, 2012 at 3:20 PM, Francesc Altedfranc...@continuum.io wrote: On 5/2/12 4:07 PM, Stéfan van der Walt wrote: Well, as the OP said, coo_matrix does not support dimensions larger than 2, right? That's just an implementation detail, I would imagine--I'm trying to figure out if there is a new principle behind synthetic dimensions? No, no new things under the sun. We were just talking about many different things and this suddenly appeared in our talk. Nothing really serious. By the way, David Cournapeau mentioned using b-trees for sparse ops a while ago; did you ever talk to him about those ideas? Yup, the b-tree idea fits very well for indexing the coordinates. Although one problem with b-trees is that they do not compress well in general. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Test failures - which dependencies am I missing?
Chris Ball ceball at gmail.com writes: Keith Hughitt keith.hughitt at gmail.com writes: Hi Chris, Try sudo apt-get build-dep python-numpy to install the dependencies for building NumPy. I believe it will install all of the optional dependencies as well. Thanks for that, but I'd already tried it and found the same failures. However, I also found that on my version of Ubuntu (10.04 LTS), which includes NumPy 1.3.0, running numpy.test(verbose=3) yielded the following: (Above, numpy is the Ubuntu-supplied numpy in case that wasn't clear.) nose.config: INFO: Excluding tests matching ['f2py_ext','f2py_f90_ext','gen_ext', 'pyrex_ext', 'swig_ext', 'array_from_pyobj'] I've discovered that numpy itself explicitly excludes these tests (in numpy/testing/nosetester.py): # Stuff to exclude from tests. These are from numpy.distutils excludes = ['f2py_ext', 'f2py_f90_ext', 'gen_ext', 'pyrex_ext', 'swig_ext'] So, all is explained now. Chris ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On Wed, May 2, 2012 at 11:26 PM, Francesc Alted franc...@continuum.io wrote: On 5/2/12 4:20 PM, Nathaniel Smith wrote: On Wed, May 2, 2012 at 9:53 PM, Francesc Altedfranc...@continuum.io wrote: On 5/2/12 11:16 AM, Wolfgang Kerzendorf wrote: Hi all, I'm currently writing a code that needs three dimensional data (for the physicists it's dimensions are atom, ion, level). The problem is that not all combinations do exist (a sparse array). Sparse matrices in scipy only deal with two dimensions. The operations that I need to do on those are running functions like exp(item/constant) on all of the items. I also want to sum them up in the last dimension. What's the best way to make a class that takes this kind of data and does the required operations fast. Maybe some phycisists have implemented these things already. Any thoughts? Curiously enough, I have recently been discussing with Travis O. about how to represent sparse matrices with complete generality. One of the possibilities is to use what Travis call synthetic dimensions. The idea behind it is easy: use a table with as many columns as dimensions, and add another one for the actual values of the array. For a 3-D sparse array, this looks like: dim0 | dim1 | dim2 | value == 0 | 0 | 0 | val0 0 | 10 | 100 | val1 20 | 5 | 202 | val2 This coordinate format is also what's used by the MATLAB Tensor Toolbox. They have a paper justifying this choice and describing some tricks for how to work with them: http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1 (Spoiler: you use a lot of sort operations. Conveniently, timsort appears to be perfectly adapted for their algorithmic requirements.) Uh, I do not have access to the article, but yes, sorting makes a lot of sense for these scenarios (this is why I was suggesting using indexes, which is sort of sorting too). Doh, sorry. Sticking the title into google scholar gives me this: http://csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/SIAM-67648.pdf - N ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays initialization
On Wed, May 2, 2012 at 5:27 PM, Kevin Jacobs jac...@bioinformed.com bioinfor...@gmail.com wrote: On Wed, May 2, 2012 at 5:45 PM, Moroney, Catherine M (388D) catherine.m.moro...@jpl.nasa.gov wrote: Thanks to Perry for some very useful off-list conversation. I realize that I wasn't being clear at all in my earlier description of the problem so here it is in a nutshell: Find the best match in an array t(5000, 7) for a single vector e(7). Now scale it up so e is (128, 512, 7) and I want to return a (128, 512) array of the t-identifiers that are the best match for e. Best match is defined as the minimum Euclidean distance. It sounds like you want to find the nearest neighbor to a point in a high-dimensional space. This sounds like a job for a spacial data structure like a KD-tree. See: http://docs.scipy.org/doc/scipy/reference/spatial.html http://mloss.org/software/view/143/ http://www.mrzv.org/software/pyANN/ etc. In general this is a good suggestion - I was going to mention it earlier - but I think for this particular problem it is not better than the brute force and argmin() NumPy approach. On my laptop, the KDTree query is about a factor of 7 slower (ignoring the time cost to create the KDTree) In [45]: def foo1(element, targets): distance_squared = ((element-targets)**2).sum(1) min_index = distance_squared.argmin() return sqrt(distance_squared[min_index]), min_index : In [46]: def foo2(element, T): return T.query(element) In [47]: element = np.arange(1,8) In [48]: targets = np.random.uniform(0,8,(5000,7)) In [49]: T = scipy.spatial.KDTree(targets) In [50]: %timeit foo1(element, targets) 1000 loops, best of 3: 401 us per loop In [51]: %timeit foo2(element, T) 100 loops, best of 3: 2.92 ms per loop Just to make sure they say the same thing: In [53]: foo1(element, targets) Out[53]: (1.8173671152898632, 570) In [54]: foo2(element, T) Out[54]: (1.8173671152898632, 570) I think KDTree is more optimal for larger searches (more than 5000 elements), and fewer dimensions. For example, with 500,000 elements and 2 dimensions, I get 34 ms for NumPy and 2 ms for the KDtree. Back to the original question, for 400 us per search, even over 128x512 elements that would be 26 seconds total, I think? That might not be too bad. Cheers, Aronne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] record arrays initialization
On May 2, 2012, at 3:23 PM, numpy-discussion-requ...@scipy.org wrote: A) ?How do I most efficiently construct a record array from a single array? I want to do the following, but it segfaults on me when i try to print b. vtype = [(x, numpy.ndarray)] a = numpy.arange(0, 16).reshape(4,4) b = numpy.recarray((4), dtype=vtype, buf=a) I prefer not to use record arrays, and stick to structured arrays: In [11]: vtype = np.dtype([('x', (np.float, 4))]) In [12]: a = np.arange(16.).reshape((4,4)) In [13]: a.view(vtype) Out[13]: array([[([0.0, 1.0, 2.0, 3.0],)], [([4.0, 5.0, 6.0, 7.0],)], [([8.0, 9.0, 10.0, 11.0],)], [([12.0, 13.0, 14.0, 15.0],)]], dtype=[('x', 'f8', (4,))]) Using structured arrays is making my code complex when I try to call the vectorized function. If I stick to the original record arrays, what's the best way of initializing b from a without doing an row-by-row copy? Catherine ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays initialization
On Wed, May 2, 2012 at 7:25 PM, Aronne Merrelli aronne.merre...@gmail.comwrote: In general this is a good suggestion - I was going to mention it earlier - but I think for this particular problem it is not better than the brute force and argmin() NumPy approach. On my laptop, the KDTree query is about a factor of 7 slower (ignoring the time cost to create the KDTree) The cKDTree implementation is more than 4 times faster than the brute-force approach: T = scipy.spatial.cKDTree(targets) In [11]: %timeit foo1(element, targets) # Brute force 1000 loops, best of 3: 385 us per loop In [12]: %timeit foo2(element, T) # cKDTree 1 loops, best of 3: 83.5 us per loop In [13]: 385/83.5 Out[13]: 4.610778443113772 A FLANN implementation should be even faster--perhaps by as much as another factor of two. -Kevin ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On May 2, 2012, at 5:28 PM, Stéfan van der Walt wrote: On Wed, May 2, 2012 at 3:20 PM, Francesc Alted franc...@continuum.io wrote: On 5/2/12 4:07 PM, Stéfan van der Walt wrote: Well, as the OP said, coo_matrix does not support dimensions larger than 2, right? That's just an implementation detail, I would imagine--I'm trying to figure out if there is a new principle behind synthetic dimensions? By the way, David Cournapeau mentioned using b-trees for sparse ops a while ago; did you ever talk to him about those ideas? The only new principle (which is not strictly new --- but new to NumPy's world-view) is using one (or more) fields of a structured array as synthetic dimensions which replace 1 or more of the raw table dimensions. Thus, you could create a view of a NumPy array (or a group of NumPy arrays) where 1 or more dimensions is replaced with these sparse dimensions. This is a fully-general way to handle a mixture of sparse and dense structures in one general array interface. However, you lose the O(1) lookup as now you must search for the non-zero items in order to implement algorithms (indexes are critical and Francesc has some nice indexes in PyTables). A group-by operation can be replaced by an operation on a sparse dimension where you have mapped attributes to 1 or more dimensions in the underlying array. coo_matrix is just a special case of this more general idea.If you add the ability to compress attributes, then you get csr, csc, and various other forms of matrices as well. More to come If you are interested in this sort of thing please let me know -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays initialization
On Wed, May 2, 2012 at 4:26 PM, Moroney, Catherine M (388D) catherine.m.moro...@jpl.nasa.gov wrote: Using structured arrays is making my code complex when I try to call the vectorized function. If I stick to the original record arrays, what's the best way of initializing b from a without doing an row-by-row copy? What does your original data look like? It seems like `a` is already what you need after the reshape? Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays initialization
On Wed, May 2, 2012 at 4:46 PM, Kevin Jacobs jac...@bioinformed.com bioinfor...@gmail.com wrote: A FLANN implementation should be even faster--perhaps by as much as another factor of two. I guess it depends on whether you care about the Approximate in Fast Library for Approximate Nearest Neighbors. Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays initialization
On Wednesday, May 2, 2012, Stéfan van der Walt wrote: On Wed, May 2, 2012 at 4:46 PM, Kevin Jacobs jac...@bioinformed.comjavascript:; bioinfor...@gmail.com javascript:; wrote: A FLANN implementation should be even faster--perhaps by as much as another factor of two. I guess it depends on whether you care about the Approximate in Fast Library for Approximate Nearest Neighbors. Stéfan This is why I love following these lists! I don't think I ever would have come across this method on my own. Nifty! Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] record arrays initialization
On Wed, May 2, 2012 at 6:46 PM, Kevin Jacobs jac...@bioinformed.com bioinfor...@gmail.com wrote: On Wed, May 2, 2012 at 7:25 PM, Aronne Merrelli aronne.merre...@gmail.com wrote: In general this is a good suggestion - I was going to mention it earlier - but I think for this particular problem it is not better than the brute force and argmin() NumPy approach. On my laptop, the KDTree query is about a factor of 7 slower (ignoring the time cost to create the KDTree) The cKDTree implementation is more than 4 times faster than the brute-force approach: T = scipy.spatial.cKDTree(targets) In [11]: %timeit foo1(element, targets) # Brute force 1000 loops, best of 3: 385 us per loop In [12]: %timeit foo2(element, T) # cKDTree 1 loops, best of 3: 83.5 us per loop In [13]: 385/83.5 Out[13]: 4.610778443113772 Wow, not sure how I missed that! It even seems to scale better than linear (some of that 83us is call overhead, I guess): In [35]: %timeit foo2(element, T) 1 loops, best of 3: 115 us per loop In [36]: elements = np.random.uniform(0,8,[128,512,7]) In [37]: %timeit foo2(elements.reshape((128*512,7)), T) 1 loops, best of 3: 2.66 s per loop So only 2.7 seconds to search the whole set. Not bad! Cheers, Aronne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On Wed, May 2, 2012 at 6:25 PM, Travis Oliphant tra...@continuum.io wrote: The only new principle (which is not strictly new --- but new to NumPy's world-view) is using one (or more) fields of a structured array as synthetic dimensions which replace 1 or more of the raw table dimensions. Ah, thanks--that's the detail I was missing. I wonder if the contiguity requirement will hamper us here, though. E.g., I could imagine that some tree structure might be more suitable to storing and organizing indices, and for large arrays we wouldn't like to make a copy for each operation. I guess we can't wait for discontiguous arrays to come along, though :) More to come If you are interested in this sort of thing please let me know Definitely--if we can optimize this machinery it will be beneficial to scipy.sparse as well. Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On Wed, May 2, 2012 at 3:20 PM, Nathaniel Smith n...@pobox.com wrote: On Wed, May 2, 2012 at 9:53 PM, Francesc Alted franc...@continuum.io wrote: On 5/2/12 11:16 AM, Wolfgang Kerzendorf wrote: Hi all, I'm currently writing a code that needs three dimensional data (for the physicists it's dimensions are atom, ion, level). The problem is that not all combinations do exist (a sparse array). Sparse matrices in scipy only deal with two dimensions. The operations that I need to do on those are running functions like exp(item/constant) on all of the items. I also want to sum them up in the last dimension. What's the best way to make a class that takes this kind of data and does the required operations fast. Maybe some phycisists have implemented these things already. Any thoughts? Curiously enough, I have recently been discussing with Travis O. about how to represent sparse matrices with complete generality. One of the possibilities is to use what Travis call synthetic dimensions. The idea behind it is easy: use a table with as many columns as dimensions, and add another one for the actual values of the array. For a 3-D sparse array, this looks like: dim0 | dim1 | dim2 | value == 0 | 0 | 0 | val0 0 | 10 | 100 | val1 20 | 5 | 202 | val2 This coordinate format is also what's used by the MATLAB Tensor Toolbox. They have a paper justifying this choice and describing some tricks for how to work with them: http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1 (Spoiler: you use a lot of sort operations. Conveniently, timsort appears to be perfectly adapted for their algorithmic requirements.) Timsort probably isn't the best choice here, it is optimized for python lists of python objects where there is at least one level of indirection and compares are expensive, even more expensive for compound objects. If the coordinates are stored in numpy structured arrays an ordinary sort is likely to be faster. Lexsort might even be faster as it could access aligned integer data and not have to move lists of indexes around. I'm not sure why one would make up a new term like synthetic dimensions though, it's just standard coordinate format... Though, for the original poster, depending on their exact problem, they might be better off just using a list or object ndarray of scipy.sparse matrices. Or some coordinate arrays like above, plus add.reduceat for the sums they mentioned. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sparse array data
On May 2, 2012, at 10:03 PM, Stéfan van der Walt wrote: On Wed, May 2, 2012 at 6:25 PM, Travis Oliphant tra...@continuum.io wrote: The only new principle (which is not strictly new --- but new to NumPy's world-view) is using one (or more) fields of a structured array as synthetic dimensions which replace 1 or more of the raw table dimensions. Ah, thanks--that's the detail I was missing. I wonder if the contiguity requirement will hamper us here, though. E.g., I could imagine that some tree structure might be more suitable to storing and organizing indices, and for large arrays we wouldn't like to make a copy for each operation. I guess we can't wait for discontiguous arrays to come along, though :) Actually, it's better to keep the actual data together as much as possible, I think, and simulate a tree structure with a layer on top --- i.e. an index. Different algorithms will prefer different orderings of the underlying data just as today different algorithms prefer different striding patterns on the standard, strided view of a dense array. -Travis More to come If you are interested in this sort of thing please let me know Definitely--if we can optimize this machinery it will be beneficial to scipy.sparse as well. Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion