Re: [Numpy-discussion] [AstroPy] Rotating and Transforming Vectors--Flight Path of a Celestial Body
2009/12/18 Wayne Watson sierra_mtnv...@sbcglobal.net: It's starting to come back to me. I found a few old graphics books that get into transformation matrices and such. Yes, earth centered. I ground out some code with geometry and trig that at least gets the first point on the path right. I think I can probably apply a rotation around the x-axis several times with a 1/2 degree rotation for each step towards the north to get the job done. I'm still fairly fresh with Python, and found a little bit of info in a Tentative numpy tutorial. I found this on getting started with matrices: from numpy import matrix Apparently matrix is a class in numpy, and there are many others, linalg I think is one. How does one find all the classes, and are there rules that keep them apart. It was tempting to add import numpy in addition to the above, but how is what is in numpy different that the classes? Is numpy solely class driven? That is, if one does a from as above to get all the classes, does it give all the capabilities that just using import numpy does? Many things in python are classes; a class is a way of attaching relevant functions to a collection of data (more precisely, a class is a *type*, defining the interpretation of data; usually they also carry a collection of functions to operate on that data). So the central feature of numpy is a class, ndarray, that represents a collection of values of homogeneous type. This may be one, two, or many-dimensional, and there are various operations, including linear algebra, on them available in numpy. The matrix class is a special kind of ndarray in which a number of modifications have been made. In particular, the * operator no longer does element-wise operations, it does matrix multiplication. There are also various rules designed to ensure that matrix objects are always two-dimensional. I avoid matrix objects like the plague, but some people find them useful. numpy.linalg is an entirely different beast. It is a *module*, a collection of functions (and potentially objects and classes). It is like sys or os: you import it and the functions, objects and classes it contains become available. This is a basic feature of python. What is unusual (but not unique) is that rather than having to explicitly import it like: import numpy import numpy.linalg numpy.linalg.svd(numpy.ones((3,2))) numpy automatically imports it for you, every time. This is done for historical reasons and won't change, but is a wart. For your purposes, I recommend simply using numpy arrays - three-element arrays for vectors, three-by-three for matrices - and using the linear algebra functions numpy provides to act on them. For example, dot does matrix-matrix, matrix-vector, and vector-vector multiplication. Anne P.S. you can usually write out a rotation explicitly, e.g. as [[cos(t), sin(t), 0], [-sin(t), cos(t), 0], [0, 0, 1]] but if you want a more general one I believe there's a clever way to make it using two reflections. -A ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] no ordinary Bessel functions?
2009/12/14 Dr. Phillip M. Feldman pfeld...@verizon.net: When I issue the command np.lookfor('bessel') I get the following: Search results for 'bessel' --- numpy.i0 Modified Bessel function of the first kind, order 0. numpy.kaiser Return the Kaiser window. numpy.random.vonmises Draw samples from a von Mises distribution. I assume that there is an ordinary (unmodified) Bessel function in NumPy, but have not been able to figure out how to access it. Also, I need to operate sometimes on scalars, and sometimes on arrays. For operations on scalars, are the NumPy Bessel functions significantly slower than the SciPy Bessel functions? I am afraid this is one of the historical warts in numpy. The proper place for special functions is in scipy.special, which does indeed have various flavours of Bessel functions. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Repeated dot products
2009/12/12 T J tjhn...@gmail.com: Hi, Suppose I have an array of shape: (n, k, k). In this case, I have n k-by-k matrices. My goal is to compute the product of a (potentially large) user-specified selection (with replacement) of these matrices. For example, x = [0,1,2,1,3,3,2,1,3,2,1,5,3,2,3,5,2,5,3,2,1,3,5,6] says that I want to take the 0th matrix and dot it with the 1st matrix and dot that product with the 2nd matrix and dot that product with the 1st matrix again and so on... Essentially, I am looking for efficient ways of doing this. It seems like what I *need* is for dot to be a ufunc with a reduce() operator. Then I would construct an array of the matrices, as specified by the input x. For now, I am using a python loop and this is unbearable. prod = np.eye(k) for i in x: ... prod = dot(prod, matrices[i]) ... Is there a better way? You are right that numpy/scipy should have a matrix product ufunc. In fact it should have ufunc versions of all the linear algebra tools. It even has the infrastructure (generalized ufuncs) to support such a thing in a very general way. Sadly this infrastructure is basically unused. Your best workaround depends on the sizes of the various objects involved. If k is large enough, then a k by k by k matrix multiply will be slow enough that it doesn't really matter what else you do. If x is much longer than n, you will want to avoid constructing a len(x) by k by k array. If n is really small and x really large, you might even win by some massively clever Knuthian operation that found repeated substrings in x and evaluated each corresponding matrix product only once. If on the other hand len(x) is much shorter than n, you could perhaps benefit from forming an intermediate len(x) by k by k array. This array could then be used in a vectorized reduce operation, if we had one. Normally, one can work around the lack of a vectorized matrix multiplication by forming an (n*k) by k matrix and multiplying it, but I don't think this will help any with reduce. If k is small, you can multiply two k by k matrices by producing the k by k by k elementwise product and then adding along the middle axis, but I don't think this will work well with reduction either. So I have just two practical solutions, really: (a) use python reduce() and the matrix product, possibly taking advantage of the lack of need to produce an intermediate len(x) by k by k matrix, and live with python in that part of the loop, or (b) use the fact that in recent versions of numpy there's a quasi-undocumented ufuncized matrix multiply built as a test of the generalized ufunc mechanism. I have no idea whether it supports reduce() (and if it doesn't adding it may be an unpleasant experience). Good luck, Anne P.S. if you come up with a good implementation of the repeated substring approach I'd love to hear it! -A ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Assigning complex values to a real array
2009/12/9 Dr. Phillip M. Feldman pfeld...@verizon.net: Pauli Virtanen-3 wrote: Nevertheless, I can't really regard dropping the imaginary part a significant issue. I am amazed that anyone could say this. For anyone who works with Fourier transforms, or with electrical circuits, or with electromagnetic waves, dropping the imaginary part is a huge issue because we get answers that are totally wrong. I agree that dropping the imaginary part is a wart. But it is one that is not very hard to learn to live with. I say this as someone who has been burned by it while using Fourier analysis to work with astronomical data. When I recently tried to validate a code, the answers were wrong, and it took two full days to track down the cause. I am now forced to reconsider carefully whether Python/NumPy is a suitable platform for serious scientific computing. While I find the current numpy complex-real conversion annoying, I have to say, this kind of rhetoric does not benefit your cause. It sounds childish and manipulative, and makes even people who agree in principle want to tell you to go ahead and use MATLAB and stop pestering us. We are not here to sell you on numpy; if you hate it, don't use it. We are here because *we* use it, warts and all, and we want to discuss interesting topics related to numpy. That you would have implemented it differently is not very interesting if you are not even willing to understand why it is the way it is and what a change would cost, let alone propose a workable way to improve. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Find the N maximum values and corresponding indexes in an array
2009/12/2 Keith Goodman kwgood...@gmail.com: On Wed, Dec 2, 2009 at 5:09 PM, Neal Becker ndbeck...@gmail.com wrote: David Warde-Farley wrote: On 2-Dec-09, at 6:55 PM, Howard Chong wrote: def myFindMaxA(myList): implement finding maximum value with for loop iteration maxIndex=0 maxVal=myList[0] for index, item in enumerate(myList): if item[0]maxVal: maxVal=item[0] maxIndex=index return [maxIndex, maxVal] My question is: how can I make the latter version run faster? I think the answer is that I have to do the iteration in C. def find_biggest_n(myarray, n): ind = np.argsort(myarray) return ind[-n:], myarray[ind[-n:]] David Not bad, although I wonder whether a partial sort could be faster. I'm doing a lot of sorting right now. I only need to sort the lowest 30% of values in a 1d array (about 250k elements), the rest I don't need to sort. How do I do a partial sort? Algorithmically, if you're doing a quicksort, you just don't sort one side of a partition when it's outside the range you want sorted (which could even be data-dependent, I suppose, as well as number-of-items-dependent). This is useful for sorting out only the extreme elements, as well as finding quantiles (and those elements above/below them). Unfortunately I'm not aware of any implementation, useful though it might be. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Find the N maximum values and corresponding indexes in an array
2009/12/2 David Warde-Farley d...@cs.toronto.edu: On 2-Dec-09, at 8:09 PM, Neal Becker wrote: Not bad, although I wonder whether a partial sort could be faster. Probably (if the array is large) but depending on n, not if it's in Python. Ideal problem for Cython, though. How is Cython support for generic types these days? One wouldn't want to have to write separate versions for each dtype... Anne David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Find the N maximum values and corresponding indexes in an array
2009/12/2 Keith Goodman kwgood...@gmail.com: On Wed, Dec 2, 2009 at 5:27 PM, Anne Archibald peridot.face...@gmail.com wrote: 2009/12/2 Keith Goodman kwgood...@gmail.com: On Wed, Dec 2, 2009 at 5:09 PM, Neal Becker ndbeck...@gmail.com wrote: David Warde-Farley wrote: On 2-Dec-09, at 6:55 PM, Howard Chong wrote: def myFindMaxA(myList): implement finding maximum value with for loop iteration maxIndex=0 maxVal=myList[0] for index, item in enumerate(myList): if item[0]maxVal: maxVal=item[0] maxIndex=index return [maxIndex, maxVal] My question is: how can I make the latter version run faster? I think the answer is that I have to do the iteration in C. def find_biggest_n(myarray, n): ind = np.argsort(myarray) return ind[-n:], myarray[ind[-n:]] David Not bad, although I wonder whether a partial sort could be faster. I'm doing a lot of sorting right now. I only need to sort the lowest 30% of values in a 1d array (about 250k elements), the rest I don't need to sort. How do I do a partial sort? Algorithmically, if you're doing a quicksort, you just don't sort one side of a partition when it's outside the range you want sorted (which could even be data-dependent, I suppose, as well as number-of-items-dependent). This is useful for sorting out only the extreme elements, as well as finding quantiles (and those elements above/below them). Unfortunately I'm not aware of any implementation, useful though it might be. Oh, I thought he meant there was a numpy function for partial sorting. What kind of speed up for my problem (sort lower 30% of a 1d array with 250k elements) could I expect if I paid someone to write it in cython? Twice as fast? I'm actually doing x.argsort(), if that matters. Hard to say exactly, but I'd say at best a factor of two. You can get a rough best-case value by doing an argmin followed by an argsort of the first 30%. In reality things will be a bit worse because you won't immediately be able to discard the whole rest of the array, and because your argsort will be accessing data spread over a larger swath of memory (so worse cache performance). But if this doesn't provide a useful speedup, it's very unlikely partial sorting will be of any use to you. Medians and other quantiles, or partitioning tasks, are where it really shines. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] convert strides/shape/offset into nd index?
2009/11/30 James Bergstra bergs...@iro.umontreal.ca: Your question involves a few concepts: - an integer vector describing the position of an element - the logical shape (another int vector) - the physical strides (another int vector) Ignoring the case of negative offsets, a physical offset is the inner product of the physical strides with the position vector. In these terms, you are asking how to solve the inner-product equation for the position vector. There can be many possible solutions (like, if there is a stride of 1, then you can make that dimension account for the entire offset. This is often not the solution you want.). For valid ndarrays though, there is at most one solution though with the property that every position element is less than the shape. You will also need to take into account that for certain stride vectors, there is no way to get certain offsets. Imagine all the strides were even, and you needed to get at an odd offset... it would be impossible. It would even be impossible if there were a dimension with stride 1 but it had shape of 1 too. I can't think of an algorithm off the top of my head that would do this in a quick and elegant way. Not to be a downer, but this problem is technically NP-complete. The so-called knapsack problem is to find a subset of a collection of numbers that adds up to the specified number, and it is NP-complete. Unfortunately, it is exactly what you need to do to find the indices to a particular memory location in an array of shape (2,2,...,2). What that means in practice is that either you have to allow potentially very slow algorithms (though you know that there will never be more than 32 different values in the knapsack, which might or might not be enough to keep things tractable) or use heuristic algorithms that don't always work. There are probably fairly good heuristics, particularly if the array elements are all at distinct memory locations (arrays with overlapping elements can arise from broadcasting and other slightly more arcane operations). Anne James On Sun, Nov 29, 2009 at 10:36 AM, Zachary Pincus zachary.pin...@yale.edu wrote: Hi all, I'm curious as to what the most straightforward way is to convert an offset into a memory buffer representing an arbitrarily strided array into the nd index into that array. (Let's assume for simplicity that each element is one byte...) Does sorting the strides from largest to smallest and then using integer division and mod (in the obvious way) always work? (I can't seem to find a counterexample, but I am not used to thinking too deeply about bizarrely-strided configurations.) Is there a method that doesn't involve sorting? Thanks, Zach ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- http://www-etud.iro.umontreal.ca/~bergstrj ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] non-standard standard deviation
2009/11/29 Dr. Phillip M. Feldman pfeld...@verizon.net: All of the statistical packages that I am currently using and have used in the past (Matlab, Minitab, R, S-plus) calculate standard deviation using the sqrt(1/(n-1)) normalization, which gives a result that is unbiased when sampling from a normally-distributed population. NumPy uses the sqrt(1/n) normalization. I'm currently using the following code to calculate standard deviations, but would much prefer if this could be fixed in NumPy itself: This issue was the subject of lengthy discussions on the mailing list, the upshot of which is that in current versions of scipy, std and var take an optional argument ddof, into which you can supply 1 to get the normalization you want. Anne def mystd(x=numpy.array([]), axis=None): This function calculates the standard deviation of the input using the definition of standard deviation that gives an unbiased result for samples from a normally-distributed population. -- View this message in context: http://old.nabble.com/non-standard-standard-deviation-tp26566808p26566808.html Sent from the Numpy-discussion mailing list archive at Nabble.com. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Computing Simple Statistics When Only they Frequency Distribution is Known
2009/11/28 Wayne Watson sierra_mtnv...@sbcglobal.net: Anne Archibald wrote: 2009/11/28 Wayne Watson sierra_mtnv...@sbcglobal.net: I was only illustrating a way that I would not consider, since the hardware has already created the pdf. I've already coded it pretty much as you have suggested. As I think I mention ed above, I'm a bit surprised numpy doesn't provide the code you suggest as part of some function. CalcSimplefromPDF(xvalues=mydatarray, avg=ture, minmax=true, ...). Feel free to submit an implementation to numpy's issue tracker. I suggest modifying mean, std, and var (at least) so that, like average, they take an array of weights. How would I do that? Obtain a copy of recent numpy source code - a .tar file from the website, or using SVN, or git. Then add the feature plus some tests, confirm that the tests pass, and post a request and your patch to the bug tracker. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Computing Simple Statistics When Only they Frequency Distribution is Known
2009/11/28 Wayne Watson sierra_mtnv...@sbcglobal.net: I was only illustrating a way that I would not consider, since the hardware has already created the pdf. I've already coded it pretty much as you have suggested. As I think I mention ed above, I'm a bit surprised numpy doesn't provide the code you suggest as part of some function. CalcSimplefromPDF(xvalues=mydatarray, avg=ture, minmax=true, ...). Feel free to submit an implementation to numpy's issue tracker. I suggest modifying mean, std, and var (at least) so that, like average, they take an array of weights. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
2009/11/27 Christopher Barker chris.bar...@noaa.gov: The point is that I don't think we can just decide to use Unicode or Bytes in all places where PyString was used earlier. Agreed. I only half agree. It seems to me that for almost all situations where PyString was used, the right data type is a python3 string (which is unicode). I realize there may be some few cases where it is appropriate to use bytes, but I think there needs to be a compelling reason for each one. In a way, unicode strings are a bit like arrays: they have an encoding associated with them (like a dtype in numpy). You can represent a given bit of text in multiple different arangements of bytes, but they are all supposed to mean the same thing and, if you know the encoding, you can convert between them. This is kind of like how one can represent 5 in any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any value represented by one dtype can be converted to all other dtypes, but many can. Just like encodings. This is incorrect. Unicode objects do not have default encodings or multiple internal representations (within a single python interpreter, at least). Unicode objects use 2- or 4-byte internal representations internally, but this is almost invisible to the user. Encodings only become relevant when you want to convert a unicode object to a byte stream. It is usually an error to store text in a byte stream (for it to make sense you must provide some mechanism to specify the encoding). Anyway, all this brings me to think about the use of strings in numpy in this way: if it is meant to be a human-readable piece of text, it should be a unicode object. If not, then it is bytes. So: fromstring and the like should, of course, work with bytes (though maybe buffers really...) I think if you're going to call it fromstring, it should onvert from strings (i.e. unicode strings). But really, I think it makes more sense to rename it frombytes() and have it convert bytes objects. One could then have def fromstring(s, encoding=utf-8): return frombytes(s.encode(encoding)) as a shortcut. Maybe ASCII makes more sense as a default encoding. But really, think about where the user's going to get the srting: most of the time it's coming from a disk file or a network stream, so it will be a byte string already, so they should use frombytes. To summarize the use cases I've ran across so far: 1) For 'S' dtype, I believe we use Bytes for the raw data and the interface. I don't think so here. 'S' is usually used to store human-readable strings, I'd certainly expect to be able to do: s_array = np.array(['this', 'that'], dtype='S10') And I'd expect it to work with non-literals that were unicode strings, i.e. human readable text. In fact, it's pretty rare that I'd ever want bytes here. So I'd see 'S' mapped to 'U' here. +1 Francesc Alted wrote: the next should still work: In [2]: s = np.array(['asa'], dtype=S10) In [3]: s[0] Out[3]: 'asa' # will become b'asa' in Python 3 I don't like that -- I put in a string, and get a bytes object back? I agree. In [4]: s.dtype.itemsize Out[4]: 10 # still 1-byte per element But what it the the strings passed in aren't representable in one byte per character? Do we define S as only supporting ANSI-only string? what encoding? Itemsize will change. That's fine. 3) Format strings a = array([], dtype=b'i4') I don't think it makes sense to handle format strings in Unicode internally -- they should always be coerced to bytes. This should be fine -- we control what is a valid format string, and thus they can always be ASCII-safe. I have to disagree. Why should we force the user to use bytes? The format strings are just that, strings, and we should be able to supply python strings to them. Keep in mind that coercing strings to bytes requires extra information, namely the encoding. If you want to emulate python2's value-dependent coercion - raise an exception only if non-ASCII is present - keep in mind that python3 is specifically removing that behaviour because of the problems it caused. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Adding the new polynomial/chebyshev modules.
2009/11/16 Christopher Barker chris.bar...@noaa.gov: Charles R Harris wrote: I would like some advise on the best way to add the new functions. I've added a new package polynomial, and that package contains four new modules: chebyshev, polynomial, polytemplate, polyutils. This seems to belong more in scipy than numpy, but I'll leave that to others to decide. whether or not to include all of the functions in these packages in the __init__.py, or to just import the modules. Are any of them compiled code? I've been very frustrated when I can't use some pure python stuff in scipy because broken compiled fortran extensions are getting imported that I don't even need. If that isn't an issue, and the polynomial package would end up with only a handful of names, then I say import them all. Another way to ask this: would there by ANY names in the polynomial package if you don't import the modules? If there is compiled code, the import could fail gracefully, and then you could still pull it all in. OTOH, what this does is bring stuff into memory unnecessarily, and also brings it into stand-alone bundles (py2exe, py2app, etc). So if these modules are not small, then it's probably better to have to import them explicitly. Also -- do you foresee many more polynomial types in the future? I know I'd like to see Hermite. I have kind of gone silent on this issue, in part because I have been busy with other things. I think that Charles Harris' framework for working with polynomials is perfectly sufficient for adding Chebyshev polynomials, and since they're finished and working and fill a concrete need, they should probably go into numpy or scipy as is. But if you want to start introducing other types of polynomials - Hermite, Lagrange interpolation based, Bernstein based, or other - I think we would need to revive the discussion about how to unify all these different types. Anne -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/16 Christopher Barker chris.bar...@noaa.gov: Anne Archibald wrote: 2009/11/13 Christopher Barker chris.bar...@noaa.gov: Wow! great -- you sounded interested, but I had no idea you'd run out and do it! thanks! we'll check it out. well, it turns out the Python version is unacceptably slow. However, we found we could use ckdtree, and use: tree.query(points, 2, # Number of points to return per point given. distance_upper_bound = distance_tolerance, ) This gives us a set of pairs of points closer than our distance tolerance -- it includes duplicates, of course, but even when we filter those in pure python, it's is nice and speedy. It looks like it would be pretty easy to add this to the Cython code, but it's working now for us, so... You probably noticed this, but I should remind you that if a point has more than one nearby neighbor, this may not give you what you wanted. In particular, if there are three points all within the fixed distance from each other, you may not find all three pairs this way. Anne Thanks again, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Adding the new polynomial/chebyshev modules.
2009/11/16 Robert Kern robert.k...@gmail.com: On Mon, Nov 16, 2009 at 18:05, Christopher Barker chris.bar...@noaa.gov wrote: Charles R Harris wrote: That's what I ended up doing. You still need to do import numpy.polynomial to get to them, they aren't automatically imported into the numpy namespace. good start. This brings up a semi-off-topic question: Is there a way to avoid importing everything when importing a module deep in a big package? The package authors need to keep the __init__.py files clear. There is nothing you can do as a user. The reason numpy and scipy don't do this is largely historical - Numeric had a nearly flat namespace, and imported all the submodules in any case, and numpy is trying to remain somewhat compatible; scipy originally tried to reexport all the numpy symbols, for some reason, and there too it is maintaining compatibility. Since spatial is new, though, it should be pretty good about not inflating your imports unnecessarily. It does, of course, import various things itself, which may balloon out the imports. Anne -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/13 Christopher Barker chris.bar...@noaa.gov: Anne Archibald wrote: 2009/11/10 Christopher Barker chris.bar...@noaa.gov: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is now implemented in SVN. Wow! great -- you sounded interested, but I had no idea you'd run out and do it! thanks! we'll check it out. No problem, it was a fairly minor modification of the two-tree code. I (tentatively?) used a set to store the collection of pairs, because my tree traversal is not smart enough to reliably uniquify the pairs without using sets. With more time and energy, I'm sure the algorithm could be improved to avoid using sets (both internally and on return), but I think that's something to save for the Cython version. I agree -- what's wrong with using a set? It should be possible to arrange the tree traversal algorithm so that each pair of subtrees is automatically traversed only once - akin to for i in range(n): for j in range(i): This would avoid having to build and modify a set every time you traverse the tree (sets are fairly cheap hash tables, so the cost is probably minor compared to other python overhead), and it would also naturally allow the result to be a list/inflatable array (which would save memory, if nothing else). But it would also require carefully thinking through the tree traversal algorithm. Thanks, we'll let you know how it works for us. Good luck! Anne -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/12 Lou Pecora lou_boog2...@yahoo.com: - Original Message From: Christopher Barker chris.bar...@noaa.gov To: Discussion of Numerical Python numpy-discussion@scipy.org Sent: Thu, November 12, 2009 12:37:37 PM Subject: Re: [Numpy-discussion] finding close together points. Lou Pecora wrote: a KD tree for 2D nearest neighbor seems like over kill. You might want to try the simple approach of using boxes of points to narrow things down by sorting on the first component. yeah, we'll probably do something like that if we have to write the code ourselves. At the moment, we're using geohash: http://pypi.python.org/pypi/Geohash/ (this is for points on the earth) and it's working OK. I was just hoping kdtree would work out of the box! where for a static data set it can match KD trees in speed Why does it have to be static -- it doesn't look hard to insert/remove points. --- Lou answers -- It doesn't have to be static, but if I remember correctly, when you add points you have to resort or do a smart insert (which may require a lot of data shifting). Not hard, but the original paper claimed that if there are a lot of changes to the data set this becomes more expensive than the kd-tree. If you can get the original paper I mentioned, it will have more info. It's pretty clearly written and contains some nice comparisons to kd-trees for finding near neighbors. Bottom line is that you can still try it with changing data. See what the run times are like. I've used the approach for up to eight dimensions with about 10^5 data points. It worked nicely. I remember looking for a kd-tree library that would work out of the box (C or C++), but everything I found was not as simple as I would have liked. The slice searching was a nice solution. But maybe I just didn't know where to look or I'm not understanding some of the libraries I did find. Let us know how it goes. Just a quick comment: none of the KDTrees in scipy.spatial support any kind of modification right now, so if your data set changes you will need to rebuild them completely. Construction is fairly cheap, but it's very unlikely to compete with a data structure that supports modification algorithms. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/11 Christopher Barker chris.bar...@noaa.gov: Anne Archibald wrote: 2009/11/10 Christopher Barker chris.bar...@noaa.gov: I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is an eminently reasonable thing to want, and KDTree should support it. Unfortunately it doesn't. darn. It's slow both because you're using the python implementation rather than the C, I noticed that. The one good thing about using the python implementation rather than the Cython one is that you can subclass it, providing a new method. There's still a certain amount of boilerplate code to write, but it shouldn't be too bad. Can you give me a hint as to where to start? I have no idea how kd trees work. If this is still too slow, I have no problem incorporating additional code into cKDTree; the only reason none of the ball queries are in there is because ball queries must return variable-size results, so you lose a certain amount of speed because you're forced to manipulate python objects. But if there are relatively few result points, this need not be much of a slowdown. And certainly in this case, there would not be very many results, and lists are pretty fast. This is now implemented in SVN. I (tentatively?) used a set to store the collection of pairs, because my tree traversal is not smart enough to reliably uniquify the pairs without using sets. With more time and energy, I'm sure the algorithm could be improved to avoid using sets (both internally and on return), but I think that's something to save for the Cython version. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding close together points.
2009/11/10 Christopher Barker chris.bar...@noaa.gov: Hi all, I have a bunch of points in 2-d space, and I need to find out which pairs of points are within a certain distance of one-another (regular old Euclidean norm). This is an eminently reasonable thing to want, and KDTree should support it. Unfortunately it doesn't. scipy.spatial.KDTree.query_ball_tree() seems like it's built for this. However, I'm a bit confused. The first argument is a kdtree, but I'm calling it as a method of a kdtree -- I want to know which points in the tree I already have are closer that some r from each-other. If I call it as: tree.query_ball_tree(tree, r) I get a big list, that has all the points in it (some of them paired up with close neighbors.) It appears I'm getting the distances between all the points in the tree and itself, as though they were different trees. This is slow, takes a bunch of memory, and I then have to parse out the list to find the ones that are paired up. Is there a way to get just the close ones from the single tree? Unfortunately not at the moment. It's slow both because you're using the python implementation rather than the C, and because you're getting all pairs where pair includes pairing a point with itself (and also each distinct pair in both orders). The tree really should allow self-queries that don't return the point and itself. The one good thing about using the python implementation rather than the Cython one is that you can subclass it, providing a new method. There's still a certain amount of boilerplate code to write, but it shouldn't be too bad. If this is still too slow, I have no problem incorporating additional code into cKDTree; the only reason none of the ball queries are in there is because ball queries must return variable-size results, so you lose a certain amount of speed because you're forced to manipulate python objects. But if there are relatively few result points, this need not be much of a slowdown. Anne thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Use-case for np.choose
2009/11/8 josef.p...@gmail.com: On Sat, Nov 7, 2009 at 7:53 PM, David Goldsmith d.l.goldsm...@gmail.com wrote: Thanks, Anne. On Sat, Nov 7, 2009 at 1:32 PM, Anne Archibald peridot.face...@gmail.com wrote: 2009/11/7 David Goldsmith d.l.goldsm...@gmail.com: snip Also, my experimenting suggests that the index array ('a', the first argument in the func. sig.) *must* have shape (choices.shape[-1],) - someone please let me know ASAP if this is not the case, and please furnish me w/ a counterexample because I was unable to generate one myself. It seems like a and each of the choices must have the same shape So in essence, at least as it presently functions, the shape of 'a' *defines* what the individual choices are within 'choices`, and if 'choices' can't be parsed into an integer number of such individual choices, that's when an exception is raised? (with the exception that choices acn be scalars), but I would consider this a bug. OK, then we definitely need more people to opine on this, because, if the the two don't match, our established policy is to document *desired* behavior, not extant behavior (and file a bug ticket). Really, a and all the choices should be broadcast to the same shape. Or maybe it doesn't make sense to broadcast a - it could be Thus begging the question: does anyone actually have an extant, specific use-case? valuable to know that the result is always exactly the same shape as a - but broadcasting all the choice arrays presents an important improvement of choose over fancy indexing. Then perhaps we need either another function, or a flag specifying which behavior this one should exhibit. There's a reason choose accepts a sequence of arrays as its second argument, rather than a higher-dimensional array. And that reason is probably supposed to be transparent above, but I've confused it by this point, so can you please reiterate it here, in so many words. :-) From looking at a few special cases, I think that full broadcasting rules apply. First a and all choice array are broadcast to the same shape, then the selection is done according to the elements of (the broadcasted) a. For broadcasting it doesn't matter whether they are scalars or 1d or 2d or a 2d single column array. (I haven't tried more than 2 dimensions) This must have changed since 1.2.1, since I get ValueError: too many dimensions for all those examples. (Though the 1.2.1 docs claim broadcasting.) The examples look a bit messy, but broadcasting is relatively straightforward. (I think, np.where is a bit easier to use because `a` is just a condition and doesn't require an index array) Well, a condition *is* an index array, it is just restricted to the values 0 and 1. But I agree it's not so natural to construct an index array with more values. I guess you could do something like: A = (C-1) + (C0) + (C1) B = np.choose(A, (1, -C, C, 1)) But that's one of those magical numpy expressions that make the reader stop and ask how on earth does that work? For further bewilderment you'd make it computed: A = np.sum(C[...,None]transition_values, axis=-1) B = np.choose(A, [piece(C) for piece in piecewise_pieces]) Anne Josef np.choose(1, (3,4)) 4 np.choose(0, (3,4)) 3 np.choose(0, (np.arange(3)[:,None],np.arange(4),0)) array([[0, 0, 0, 0], [1, 1, 1, 1], [2, 2, 2, 2]]) np.choose(2, (np.arange(3)[:,None],np.arange(4),0)) array([[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]) np.choose(1, (np.arange(3)[:,None],np.arange(4),0)) array([[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]) np.choose([1,2,0,0], (np.arange(3)[:,None],np.arange(4),0)) array([[0, 0, 0, 0], [0, 0, 1, 1], [0, 0, 2, 2]]) np.choose(np.array([[1,2,0,0]]), (np.arange(3)[:,None],np.arange(4),0)) array([[0, 0, 0, 0], [0, 0, 1, 1], [0, 0, 2, 2]]) np.choose(np.array([[1,2,0]]).T, (np.arange(3)[:,None],np.arange(4),0)) array([[0, 1, 2, 3], [0, 0, 0, 0], [2, 2, 2, 2]]) Thanks again, DG Anne Thanks, DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Use-case for np.choose
2009/11/8 David Goldsmith d.l.goldsm...@gmail.com: On Sat, Nov 7, 2009 at 11:59 PM, Anne Archibald peridot.face...@gmail.com wrote: 2009/11/7 David Goldsmith d.l.goldsm...@gmail.com: So in essence, at least as it presently functions, the shape of 'a' *defines* what the individual choices are within 'choices`, and if 'choices' can't be parsed into an integer number of such individual choices, that's when an exception is raised? Um, I don't think so. Think of it this way: you provide np.choose with a selector array, a, and a list (not array!) [c0, c1, ..., cM] of choices. You construct an output array, say r, the same shape as a (no matter how many dimensions it has). Except that I haven't yet seen a working example with 'a' greater than 1-D, Josef's last two examples notwithstanding; or is that what you're saying is the bug. There's nothing magic about A being one-dimensional. C = np.random.randn(2,3,5) A = (C-1).astype(int) + (C0).astype(int) + (C1).astype(int) R = np.choose(A, (-1, -C, C, 1)) Requv = np.minimum(np.abs(C),1) or: def wedge(*functions): Return a function whose value is the minimum of those of functions def wedgef(X): fXs = [f(X) for f in functions] A = np.argmin(fXs, axis=0) return np.choose(A,fXs) return wedgef so e.g. np.abs is -wedge(lambda X: X, lambda X: -X) This works no matter what shape of X the user supplies - so a wedged function can be somewhat ufunclike - by making A the same shape. The (i0, i1, ..., iN) element of the output array is obtained by looking at the (i0, i1, ..., iN) element of a, which should be an integer no larger than M; say j. Then r[i0, i1, ..., iN] = cj[i0, i1, ..., iN]. That is, each element of the selector array determines which of the choice arrays to pull the corresponding element from. That's pretty clear (thanks for doing my work for me). ;-), Yet, see above. For example, suppose that you are processing an array C, and have constructed a selector array A the same shape as C in which a value is 0, 1, or 2 depending on whether the C value is too small, okay, or too big respectively. Then you might do something like: C = np.choose(A, [-inf, C, inf]) This is something you might want to do no matter what shape A and C have. It's important not to require that the choices be an array of choices, because they often have quite different shapes (here, two are scalars) and it would be wasteful to broadcast them up to the same shape as C, just to stack them. OK, that's a pretty generic use-case, thanks; let me see if I understand it correctly: A is some how created independently with a 0 everywhere C is too small, a 1 everywhere C is OK, and a 2 everywhere C is too big; then np.choose(A, [-inf, C, inf]) creates an array that is -inf everywhere C is too small, inf everywhere C is too large, and C otherwise (and since -inf and inf are scalars, this implies broadcasting of these is taking place). This is what you're asserting *should* be the behavior. So, unless there is disagreement about this (you yourself said the opposite viewpoint might rationally be held) np.choose definitely presently has a bug, namely, the index array can't be of arbitrary shape. There seems to be some disagreement between versions, but both Josef and I find that the index array *can* be arbitrary shape. In numpy 1.2.1 I find that all the choose items must be the same shape as it, which I think is a bug. What I suggested might be okay was if the index array was not broadcasted, so that the outputs always had exactly the same shape as the index array. But upon reflection it's useful to be able to use a 1-d array to select rows from a set of matrices, so I now think that all of A and the elements of choose should be broadcast to the same shape. This seems to be what Josef observes in his version of numpy, so maybe there's nothing to do. Anne DG Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Use-case for np.choose
As Josef said, this is not correct. I think the key point of confusion is this: Do not pass choose two arrays. Pass it one array and a *list* of arrays. The fact that choices can be an array is a quirk we can't change, but you should think of the second argument as a list of arrays, possibly of different shapes. np.choose(np.ones((2,1,1)), [ np.ones((1,3,1)), np.ones((1,1,5)) ] ) Here the first argument is an array of choices. The second argument is a *list* - if you cast it to an array you'll get an error or nonsense - of arrays to choose from. The broadcasting ensures the first argument and *each element of the list* are the same shape. The only constraint on the number of arrays in the list is that it be larger than the largest value in a. If you try to make the second argument into a single array, for one thing you are throwing away useful generality by forcing each choice to be the same shape (and real shape, not zero-strided fake shape), and for another the broadcasting becomes very hard to understand. Anne 2009/11/8 David Goldsmith d.l.goldsm...@gmail.com: OK, now I'm trying to wrap my brain around broadcasting in choose when both `a` *and* `choices` need to be (non-trivially) broadcast in order to arrive at a common shape, e.g.: c=np.arange(4).reshape((2,1,2)) # shape is (2,1,2) a=np.eye(2, dtype=int) # shape is (2,2) np.choose(a,c) array([[2, 1], [0, 3]]) (Unfortunately, the implementation is in C, so I can't easily insert print statements to see intermediate results.) First, let me confirm that the above is indeed an example of what I think it is, i.e., both `a` and `choices` are broadcast in order for this to work, correct? (And if incorrect, how is one broadcast to the shape of the other?) Second, both are broadcast to shape (2,2,2), correct? But how, precisely, i.e., does c become [[[0, 1], [2, 3]], [[[0, 1], [0, 1]], [[0, 1], [2, 3]]] or [[2, 3], [2, 3]]] and same question for a? Then, once a is broadcast to a (2,2,2) shape, precisely how does it pick and choose from c to create a (2,2) result? For example, suppose a is broadcast to: [[[1, 0], [0, 1]], [[1, 0], [0, 1]]] (as indicated above, I'm uncertain at this point if this is indeed what a is broadcast to); how does this create the (2,2) result obtained above? (Obviously this depends in part on precisely how c is broadcast, I do recognize that much.) Finally, a seemingly relevant comment in the C source is: /* Broadcast all arrays to each other, index array at the end.*/ This would appear to confirm that co-broadcasting is performed if necessary, but what does the index array at the end phrase mean? Thanks for your continued patience and tutelage. DG On Sun, Nov 8, 2009 at 5:36 AM, josef.p...@gmail.com wrote: On Sun, Nov 8, 2009 at 5:00 AM, David Goldsmith d.l.goldsm...@gmail.com wrote: On Sun, Nov 8, 2009 at 12:57 AM, Anne Archibald peridot.face...@gmail.com wrote: 2009/11/8 David Goldsmith d.l.goldsm...@gmail.com: On Sat, Nov 7, 2009 at 11:59 PM, Anne Archibald peridot.face...@gmail.com wrote: 2009/11/7 David Goldsmith d.l.goldsm...@gmail.com: So in essence, at least as it presently functions, the shape of 'a' *defines* what the individual choices are within 'choices`, and if 'choices' can't be parsed into an integer number of such individual choices, that's when an exception is raised? Um, I don't think so. Think of it this way: you provide np.choose with a selector array, a, and a list (not array!) [c0, c1, ..., cM] of choices. You construct an output array, say r, the same shape as a (no matter how many dimensions it has). Except that I haven't yet seen a working example with 'a' greater than 1-D, Josef's last two examples notwithstanding; or is that what you're saying is the bug. There's nothing magic about A being one-dimensional. C = np.random.randn(2,3,5) A = (C-1).astype(int) + (C0).astype(int) + (C1).astype(int) R = np.choose(A, (-1, -C, C, 1)) OK, now I get it: np.choose(A[0,:,:], (-1,-C,C,-1)) and np.choose(A[0,:,0].reshape((3,1)), (-1,-C,C,1)), e.g., also work, but np.choose(A[0,:,0], (-1,-C,C,-1)) doesn't - what's necessary for choose's arguments is that both can be broadcast to a common shape (as you state below), but choose won't reshape the arguments for you to make this possible, you have to do so yourself first, if necessary. That does appear to be what's happening now; but do we want choose to be smarter than that (e.g., for np.choose(A[0,:,0], (-1,-C,C,-1)) to work, so that the user doesn't need to include the .reshape((3,1)))? No, I don't think we want to be that smart. If standard broadcasting rules apply, as I think they do, then I wouldn't want any special newaxis or reshapes done automatically. It will be confusing, the function wouldn't know what to do if there are, e.g., as many
Re: [Numpy-discussion] Use-case for np.choose
2009/11/8 David Goldsmith d.l.goldsm...@gmail.com: On Sun, Nov 8, 2009 at 7:40 PM, Anne Archibald peridot.face...@gmail.com wrote: As Josef said, this is not correct. I think the key point of confusion is this: Do not pass choose two arrays. Pass it one array and a *list* of arrays. The fact that choices can be an array is a quirk we can't change, but you should think of the second argument as a list of arrays, Fine, but as you say, one *can* pass choose an array as the second argument and it doesn't raise an exception, so if someone is stupid/careless enough to pass an array for `choices`, how is choose interpreting it as a list? Is the first dimension list converted (so that, e.g., my (2,1,2) example is interpreted as a two element list, each of whose elements is a (1,2) array)? It seems to me that this is the only reasonable interpretation, yes. After all, arrays behave like sequences along the first axis, whose elements are arrays of one less dimension. Thus if you pass an array, any broadcasting happens ignoring the first axis, which is a rather abnormal pattern for numpy broadcasting, but necessary here. As a bonus, I think this is what is implemented in current versions of numpy. (In 1.2.1 it raises an exception if broadcasting is necessary.) Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to get rid of the loop?
2009/11/7 Stas K stanc...@gmail.com: Thank you, Josef It is exactly what I want: ar[:,None]**2 + ar**2 Do you know something about performance of this? In my real program ar have ~ 10k elements, and expression for v more complicated (it has some trigonometric functions) The construction of ar[:,None] (which I prefer to spell ar[:,np.newaxis]) is cheap, since it just constructs a view into ar. Computing ar**2 and ar[:,None] require squaring each element of ar, but this is done in a C loop. (It's done twice in this expression, so this isn't as efficient as it might be). Only when it comes time to do the addition do you do n**2 operations. For very large n, this can be a crushing memory burden, and if you only need these values for an intermediate calculation it sometimes turns out that it's better to loop over one of the dimensions. But this expression is probably about as efficient as you can hope for, given what it does. If you need to do some more complicated calculation, the main thing to be careful of is that you do as much calculation as possible on the separate arrays, while they're still only n elements and not n**2. Anne On 07.11.2009, at 21:57, josef.p...@gmail.com wrote: On Sat, Nov 7, 2009 at 1:51 PM, Stas K stanc...@gmail.com wrote: Can I get rid of the loop in this example? And what is the fastest way to get v in the example? ar = array([1,2,3]) for a in ar: for b in ar: v = a**2+b**2 ar[:,None]**2 + ar**2 array([[ 2, 5, 10], [ 5, 8, 13], [10, 13, 18]]) I think, for this case there is also directly a function in numpy hypot which should also work. Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Use-case for np.choose
2009/11/7 David Goldsmith d.l.goldsm...@gmail.com: Hi, all! I'm working to clarify the docstring for np.choose (Stefan pointed out to me that it is pretty unclear, and I agreed), and, now that (I'm pretty sure that) I've figured out what it does in its full generality (e.g., when the 'choices' array is greater than 2-D), I'm curious as to use-cases. (Not that I'm suggesting that we deprecate it, but I am curious as to whether or not anyone is presently using it and if so, how they're using it, esp. if anyone *is* using it with 'choices' arrays 3-D or greater.) It's a generalized np.where(), allowing more than two options: In [10]: a = np.arange(2*3*5).reshape(2,3,5) % 3 In [11]: o = np.ones((2,3,5)) In [12]: np.choose(a,(o,0.5*o,0.1*o)) Out[12]: array([[[ 1. , 0.5, 0.1, 1. , 0.5], [ 0.1, 1. , 0.5, 0.1, 1. ], [ 0.5, 0.1, 1. , 0.5, 0.1]], [[ 1. , 0.5, 0.1, 1. , 0.5], [ 0.1, 1. , 0.5, 0.1, 1. ], [ 0.5, 0.1, 1. , 0.5, 0.1]]]) Also, my experimenting suggests that the index array ('a', the first argument in the func. sig.) *must* have shape (choices.shape[-1],) - someone please let me know ASAP if this is not the case, and please furnish me w/ a counterexample because I was unable to generate one myself. It seems like a and each of the choices must have the same shape (with the exception that choices acn be scalars), but I would consider this a bug. Really, a and all the choices should be broadcast to the same shape. Or maybe it doesn't make sense to broadcast a - it could be valuable to know that the result is always exactly the same shape as a - but broadcasting all the choice arrays presents an important improvement of choose over fancy indexing. There's a reason choose accepts a sequence of arrays as its second argument, rather than a higher-dimensional array. Anne Thanks, DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Use-case for np.choose
2009/11/7 David Goldsmith d.l.goldsm...@gmail.com: Thanks, Anne. On Sat, Nov 7, 2009 at 1:32 PM, Anne Archibald peridot.face...@gmail.com wrote: 2009/11/7 David Goldsmith d.l.goldsm...@gmail.com: snip Also, my experimenting suggests that the index array ('a', the first argument in the func. sig.) *must* have shape (choices.shape[-1],) - someone please let me know ASAP if this is not the case, and please furnish me w/ a counterexample because I was unable to generate one myself. It seems like a and each of the choices must have the same shape So in essence, at least as it presently functions, the shape of 'a' *defines* what the individual choices are within 'choices`, and if 'choices' can't be parsed into an integer number of such individual choices, that's when an exception is raised? Um, I don't think so. Think of it this way: you provide np.choose with a selector array, a, and a list (not array!) [c0, c1, ..., cM] of choices. You construct an output array, say r, the same shape as a (no matter how many dimensions it has). The (i0, i1, ..., iN) element of the output array is obtained by looking at the (i0, i1, ..., iN) element of a, which should be an integer no larger than M; say j. Then r[i0, i1, ..., iN] = cj[i0, i1, ..., iN]. That is, each element of the selector array determines which of the choice arrays to pull the corresponding element from. For example, suppose that you are processing an array C, and have constructed a selector array A the same shape as C in which a value is 0, 1, or 2 depending on whether the C value is too small, okay, or too big respectively. Then you might do something like: C = np.choose(A, [-inf, C, inf]) This is something you might want to do no matter what shape A and C have. It's important not to require that the choices be an array of choices, because they often have quite different shapes (here, two are scalars) and it would be wasteful to broadcast them up to the same shape as C, just to stack them. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Random int64 and float64 numbers
2009/11/5 David Goldsmith d.l.goldsm...@gmail.com: On Thu, Nov 5, 2009 at 3:26 PM, David Warde-Farley d...@cs.toronto.edu wrote: On 5-Nov-09, at 4:54 PM, David Goldsmith wrote: Interesting thread, which leaves me wondering two things: is it documented somewhere (e.g., at the IEEE site) precisely how many *decimal* mantissae are representable using the 64-bit IEEE standard for float representation (if that makes sense); IEEE-754 says nothing about decimal representations aside from how to round when converting to and from strings. You have to provide/accept *at least* 9 decimal digits in the significand for single-precision and 17 for double-precision (section 5.6). AFAIK implementations will vary in how they handle cases where a binary significand would yield more digits than that. I was actually more interested in the opposite situation, where the decimal representation (which is what a user would most likely provide) doesn't have a finite binary expansion: what happens then, something analogous to the decimal rule of fives? If you interpret 0.1 as 1/10, then this is a very general floating-point issue: how you you round off numbers you can't represent exactly? The usual answer (leaving aside internal extended-precision shenanigans) is to round, with the rule that when you're exactly between two floating-point numbers you round to the one that's even, rather than always up or always down (the numerical analysis wizards tell us that this is more numerically stable). Anne DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Random int64 and float64 numbers
2009/11/5 josef.p...@gmail.com: On Thu, Nov 5, 2009 at 10:42 PM, David Goldsmith d.l.goldsm...@gmail.com wrote: On Thu, Nov 5, 2009 at 3:26 PM, David Warde-Farley d...@cs.toronto.edu wrote: On 5-Nov-09, at 4:54 PM, David Goldsmith wrote: Interesting thread, which leaves me wondering two things: is it documented somewhere (e.g., at the IEEE site) precisely how many *decimal* mantissae are representable using the 64-bit IEEE standard for float representation (if that makes sense); IEEE-754 says nothing about decimal representations aside from how to round when converting to and from strings. You have to provide/accept *at least* 9 decimal digits in the significand for single-precision and 17 for double-precision (section 5.6). AFAIK implementations will vary in how they handle cases where a binary significand would yield more digits than that. I was actually more interested in the opposite situation, where the decimal representation (which is what a user would most likely provide) doesn't have a finite binary expansion: what happens then, something analogous to the decimal rule of fives? Since according to my calculations there are only about 4* 10**17 * 308 12320L More straightforwardly, it's not too far below 2**64. double-precision floats, there are huge gaps in the floating point representation of the real line. Any user input or calculation result just gets converted to the closest float. Yes. But the huge gaps are only huge in absolute size; in fractional error they're always about the same size. They are usually only an issue if you're representing numbers in a small range with a huge offset. To take a not-so-random example, you could be representing times in days since November 17 1858, when what you care about are the microsecond-scale differences in photon arrival times. Even then you're probably okay as long as you compute directly (t[i] = i*dt/86400 + start_t) rather than having some sort of running accumulator (t[i]=t; t += dt/86400). Professor Kahan's reasoning for using doubles for most everything is that they generally have so much more precision than you actually need that you can get away with being sloppy. Anne Josef DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Random int64 and float64 numbers
2009/11/1 Thomas Robitaille thomas.robitai...@gmail.com: Hi, I'm trying to generate random 64-bit integer values for integers and floats using Numpy, within the entire range of valid values for that type. To generate random 32-bit floats, I can use: Others have addressed why this is giving bogus results. But it's worth thinking about what you're actually trying to do. If it's fuzz tests, that is, producing unrestricted random floats to feed to a function, then even if this worked it wouldn't produce what you want: it will never produce floats of order unity, for example. If you want do what you described, you could produce floats uniformly from -1 to 1 and then multiply the results by the largest representable float. If you want random floats, I suggest generating an array of random bytes and reinterpreting them as floats. You'll get a fair number of NaNs and infinities, so you may want to take only the finite values and regenerate the rest. This will give you some numbers from all over the range of floats, including tiny values (and denormalized values, which are a potentially important special case). Anne np.random.uniform(low=np.finfo(np.float32).min,high=np.finfo (np.float32).max,size=10) which gives for example array([ 1.47351436e+37, 9.93620693e+37, 2.22893053e+38, -3.33828977e+38, 1.08247781e+37, -8.37481260e+37, 2.64176554e+38, -2.72207226e+37, 2.54790459e+38, -2.47883866e+38]) but if I try and use this for 64-bit numbers, i.e. np.random.uniform(low=np.finfo(np.float64).min,high=np.finfo (np.float64).max,size=10) I get array([ Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf]) Similarly, for integers, I can successfully generate random 32-bit integers: np.random.random_integers(np.iinfo(np.int32).min,high=np.iinfo (np.int32).max,size=10) which gives array([-1506183689, 662982379, -1616890435, -1519456789, 1489753527, -604311122, 2034533014, 449680073, -444302414, -1924170329]) but am unsuccessful for 64-bit integers, i.e. np.random.random_integers(np.iinfo(np.int64).min,high=np.iinfo (np.int64).max,size=10) which produces the following error: OverflowError: long int too large to convert to int Is this expected behavior, or are these bugs? Thanks for any help, Thomas ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Single view on multiple arrays
2009/11/1 Bill Blinn bbl...@gmail.com: What is the best way to create a view that is composed of sections of many different arrays? The short answer is, you can't. Numpy arrays must be located contiguous blocks of memory, and the elements along any dimension must be equally spaced. A view is simply another array that references (some of) the same underlying memory, possibly with different strides. For example, imagine I had a = np.array(range(0, 12)).reshape(3, 4) b = np.array(range(12, 24)).reshape(3, 4) c = np.array(range(24, 36)).reshape(3, 4) v = multiview((3, 4)) #the idea of the following lines is that the 0th row of v is #a view on the first row of a. the same would hold true for #the 1st and 2nd row of v and the 0th rows of b and c, respectively v[0] = a[0] v[1] = b[0] v[2] = c[0] #change the underlying arrays a[0, 0] = 50 b[0, 0] = 51 c[0, 0] = 52 #I would want all these assertions to pass because the view #refers to the rows of a, b and c assert v[0, 0] == 50 assert v[1, 0] == 51 assert v[2, 0] == 52 Is there any way to do this? If you need to be able to do this, you're going to have to rearrange your code somewhat, so that your original arrays are views of parts of an initial array. It's worth noting that if what you're worried about is the time it takes to copy data, you might well be surprised at how cheap data copying and memory allocation really are. Given that numpy is written in python, only for really enormous arrays will copying data be expensive, and allocating memory is really a very cheap operation (modern malloc()s average something like a few cycles). What's more, since modern CPUs are so heavily cache-bound, using strided views can be quite slow, since you end up loading whole 64-byte cache lines for each 8-byte double you need. In short, you probably need to rethink your design, but while you're doing it, don't worry about copying data. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Designing a new storage format for numpy recarrays
2009/10/30 Stephen Simmons m...@stevesimmons.com: I should clarify what I meant.. Suppose I have a recarray with 50 fields and want to read just one of those fields. PyTables/HDF will read in the compressed data for chunks of complete rows, decompress the full 50 fields, and then give me back the data for just one field. I'm after a solution where asking for a single field reads in the bytes for just that field from disk and decompresses it. This is similar to the difference between databases storing their data as rows or columns. See for example Mike Stonebraker's C-store column-oriented database (http://db.lcs.mit.edu/projects/cstore/vldb.pdf). Is there any reason not to simply store the data as a collection of separate arrays, one per column? It shouldn't be too hard to write a wrapper to give this nicer syntax, while implementing it under the hood with HDF5... Anne Stephen Francesc Alted wrote: A Friday 30 October 2009 14:18:05 Stephen Simmons escrigué: - Pytables (HDF using chunked storage for recarrays with LZO compression and shuffle filter) - can't extract individual field from a recarray Er... Have you tried the ``cols`` accessor? http://www.pytables.org/docs/manual/ch04.html#ColsClassDescr Cheers, ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Another suggestion for making numpy's functions generic
2009/10/20 Sebastian Walter sebastian.wal...@gmail.com: On Tue, Oct 20, 2009 at 5:45 AM, Anne Archibald peridot.face...@gmail.com wrote: 2009/10/19 Sebastian Walter sebastian.wal...@gmail.com: I'm all for generic (u)funcs since they might come handy for me since I'm doing lots of operation on arrays of polynomials. Just as a side note, if you don't mind my asking, what sorts of operations do you do on arrays of polynomials? In a thread on scipy-dev we're discussing improving scipy's polynomial support, and we'd be happy to get some more feedback on what they need to be able to do. I've been reading (and commenting) that thread ;) I'm doing algorithmic differentiation by computing on truncated Taylor polynomials in the Powerbasis, i.e. always truncating all operation at degree D z(t) = \sum_d=0^{D-1} z_d t^d = x(t) * y(t) = \sum_{d=0}^{D-1} \sum_{k=0}^d x_k * y_{d-k} + O(t^D) Using other bases does not make sense in my case since the truncation of all terms of higher degree than t^D has afaik no good counterpart for bases like chebycheff. On the other hand, I need to be generic in the coefficients, e.g. z_d from above could be a tensor of any shape, e.g. a matrix. In fact, truncating at degree D for Chebyshev polynomials works exactly the same way as it does for power polynomials, and if what you care about is function approximation, it has much nicer behaviour. But if what you care about is really truncated Taylor polynomials, there's no beating the power basis. I realize that arrays of polynomial objects are handy from a bookkeeping point of view, but how does an array of polynomials, say of shape (N,), differ from a single polynomial with coefficients of shape (N,)? I think we need to provide the latter in our polynomial classes, but as you point out, getting ufunc support for the former is nontrivial. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Optimized sum of squares
2009/10/20 josef.p...@gmail.com: On Sun, Oct 18, 2009 at 6:06 AM, Gary Ruben gru...@bigpond.net.au wrote: Hi Gaël, If you've got a 1D array/vector called a, I think the normal idiom is np.dot(a,a) For the more general case, I think np.tensordot(a, a, axes=something_else) should do it, where you should be able to figure out something_else for your particular case. Is it really possible to get the same as np.sum(a*a, axis) with tensordot if a.ndim=2 ? Any way I try the something_else, I get extra terms as in np.dot(a.T, a) It seems like this would be a good place to apply numpy's higher-dimensional ufuncs: what you want seems to just be the vector inner product, broadcast over all other dimensions. In fact I believe this is implemented in numpy as a demo: numpy.umath_tests.inner1d should do the job. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Optimized sum of squares
2009/10/20 josef.p...@gmail.com: On Tue, Oct 20, 2009 at 3:09 PM, Anne Archibald peridot.face...@gmail.com wrote: 2009/10/20 josef.p...@gmail.com: On Sun, Oct 18, 2009 at 6:06 AM, Gary Ruben gru...@bigpond.net.au wrote: Hi Gaël, If you've got a 1D array/vector called a, I think the normal idiom is np.dot(a,a) For the more general case, I think np.tensordot(a, a, axes=something_else) should do it, where you should be able to figure out something_else for your particular case. Is it really possible to get the same as np.sum(a*a, axis) with tensordot if a.ndim=2 ? Any way I try the something_else, I get extra terms as in np.dot(a.T, a) It seems like this would be a good place to apply numpy's higher-dimensional ufuncs: what you want seems to just be the vector inner product, broadcast over all other dimensions. In fact I believe this is implemented in numpy as a demo: numpy.umath_tests.inner1d should do the job. Thanks, this works well, needs core in name (I might have to learn how to swap or roll axis to use this for more than 2d.) np.core.umath_tests.inner1d(a.T, b.T) array([12, 8, 16]) (a*b).sum(0) array([12, 8, 16]) np.core.umath_tests.inner1d(a.T, b.T) array([12, 8, 16]) (a*a).sum(0) array([126, 166, 214]) np.core.umath_tests.inner1d(a.T, a.T) array([126, 166, 214]) What's the status on these functions? They don't show up in the docs or help, except for a brief mention in the c-api: http://docs.scipy.org/numpy/docs/numpy-docs/reference/c-api.generalized-ufuncs.rst/ Are they for public consumption and should go into the docs? Or do they remain a hidden secret, to force users to read the mailing lists? I think the long-term goal is to have a completely ufuncized linear algebra library, and I think these functions are just tests of the gufunc features. In principle, at least, it wouldn't actually be too hard to fill out a full linear algebra library, since the per element linear algebra operations already exist. Unfortunately the code should exist for many data types, and the code generator scheme currently used to do this for ordinary ufuncs is a barrier to contributions. It might be worth banging out a doubles-only generic ufunc linear algebra library (in addition to numpy.linalg/scipy.linalg), just as a proof of concept. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Another suggestion for making numpy's functions generic
2009/10/19 Sebastian Walter sebastian.wal...@gmail.com: I'm all for generic (u)funcs since they might come handy for me since I'm doing lots of operation on arrays of polynomials. Just as a side note, if you don't mind my asking, what sorts of operations do you do on arrays of polynomials? In a thread on scipy-dev we're discussing improving scipy's polynomial support, and we'd be happy to get some more feedback on what they need to be able to do. Thanks! Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] double-precision sqrt?
2009/10/17 Adam Ginsburg adam.ginsb...@colorado.edu: My code is actually wrong but I still have the problem I've identified that sqrt is leading to precision errors. Sorry about the earlier mistake. I think you'll find that numpy's sqrt is as good as it gets for double precision. You can try using numpy's float96 type, which at least on my machine, does give sa few more significant figures. If you really, really need accuracy, there are arbitrary-precision packages for python, which you could try. But I think you may find that your problem is not solved by higher precision. Something about ray-tracing just leads it to ferret out every little limitation of floating-point computation. For example, you can easily get surface acne when shooting shadow rays, where a ray shot from a surface to a light source accidentally intersects that same surface for some pixels but not for others. You can try to fix it by requiring some minimum intersection distance, but then you'll find lots of weird little quirks where your minimum distance causes problems. A better solution is one which takes into account the awkwardness of floating-point; for this particular case one trick is to mark the object you're shooting rays from as not a candidate for intersection. (This doesn't work, of course, if the object can cast shadows on itself...) I have even seen people advocate for using interval analysis inside ray tracers, to avoid this kind of problem. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loading data
2009/6/25 Mag Gam magaw...@gmail.com: Hello. I am very new to NumPy and Python. We are doing some research in our Physics lab and we need to store massive amounts of data (100GB daily). I therefore, am going to use hdf5 and h5py. The problem is I am using np.loadtxt() to create my array and create a dataset according to that. np.loadtxt() is reading a file which is about 50GB. This takes a very long time! I was wondering if there was a much easier and better way of doing this. If you are stuck with the text array, you probably can't beat numpy.loadtxt(); reading a 50 GB text file is going to be slow no matter how you cut it. So I would take a look at the code that generates the text file, and see if there's any way you can make it generate a format that is faster to read. (I assume the code is in C or FORTRAN and you'd rather not mess with it more than necessary). Of course, generating hdf5 directly is probably fastest; you might look at the C and FORTRAN hdf5 libraries and see how hard it would be to integrate them into the code that currently generates a text file. Even if you need to have a python script to gather the data and add metadata, hdf5 will be much much more efficient than text files as an intermediate format. If integrating HDF5 into the generating application is too difficult, you can try simply generating a binary format. Using numpy's structured data types, it is possible to read in binary files extremely efficiently. If you're using the same architecture to generate the files as read them, you can just write out raw binary arrays of floats or doubles and then read them into numpy. I think FORTRAN also has a semi-standard padded binary format which isn't too difficult to read either. You could even use numpy's native file format, which for a single array should be pretty straightforward, and should yield portable results. If you really can't modify the code that generates the text files, your code is going to be slow. But you might be able to make it slightly less slow. If, for example, the text files are a very specific format, especially if they're made up of columns of fixed width, it would be possible to write compiled code to read them slightly more quickly. (The very easiest way to do this is to write a little C program that reads the text files and writes out a slightly friendlier format, as above.) But you may well find that simply reading a 50 GB file dominates your run time, which would mean that you're stuck with slowness. In short: avoid text files if at all possible. Good luck, Anne TIA ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Interleaved Arrays and
I'm not sure it's worth having a function to replace a one-liner (column_stack followed by reshape). But if you're going to implement this with slice assignment, you should take advantage of the flexibility this method allows and offer the possibility of interleaving raggedly, that is, where the size of the arrays drops at some point, so that you could interleave arrays of size 4, 4, and 3 to get one array of size 11. This allows split and join operations, for example for multiprocessing. On the other hand you should also include a documentation warning that this can be slow when interleaving large numbers of small arrays. Anne 2009/6/16 Neil Martinsen-Burrell n...@wartburg.edu: On 2009-06-16 16:05 , Robert wrote: Neil Martinsen-Burrell wrote: On 06/16/2009 02:18 PM, Robert wrote: n = 10 xx = np.ones(n) yy = np.arange(n) aa = np.column_stack((xx,yy)) bb = np.column_stack((xx+1,yy)) aa array([[ 1., 0.], [ 1., 1.], [ 1., 2.], [ 1., 3.], [ 1., 4.], [ 1., 5.], [ 1., 6.], [ 1., 7.], [ 1., 8.], [ 1., 9.]]) bb array([[ 2., 0.], [ 2., 1.], [ 2., 2.], [ 2., 3.], [ 2., 4.], [ 2., 5.], [ 2., 6.], [ 2., 7.], [ 2., 8.], [ 2., 9.]]) np.column_stack((aa,bb)) array([[ 1., 0., 2., 0.], [ 1., 1., 2., 1.], [ 1., 2., 2., 2.], [ 1., 3., 2., 3.], [ 1., 4., 2., 4.], [ 1., 5., 2., 5.], [ 1., 6., 2., 6.], [ 1., 7., 2., 7.], [ 1., 8., 2., 8.], [ 1., 9., 2., 9.]]) cc = _ cc.reshape((n*2,2)) array([[ 1., 0.], [ 2., 0.], [ 1., 1.], [ 2., 1.], [ 1., 2.], [ 2., 2.], [ 1., 3.], [ 2., 3.], [ 1., 4.], [ 2., 4.], [ 1., 5.], [ 2., 5.], [ 1., 6.], [ 2., 6.], [ 1., 7.], [ 2., 7.], [ 1., 8.], [ 2., 8.], [ 1., 9.], [ 2., 9.]]) However I feel too, there is a intuitive abbrev function like 'interleave' or so missing in numpy shape_base or so. Using fancy indexing, you can set strided portions of an array equal to another array. So:: In [2]: aa = np.empty((10,2)) In [3]: aa[:, 0] = 1 In [4]: aa[:,1] = np.arange(10) In [5]: bb = np.empty((10,2)) In [6]: bb[:,0] = 2 In [7]: bb[:,1] = aa[:,1] # this works In [8]: cc = np.empty((20,2)) In [9]: cc[::2,:] = aa In [10]: cc[1::2,:] = bb In [11]: cc Out[11]: array([[ 1., 0.], [ 2., 0.], [ 1., 1.], [ 2., 1.], [ 1., 2.], [ 2., 2.], [ 1., 3.], [ 2., 3.], [ 1., 4.], [ 2., 4.], [ 1., 5.], [ 2., 5.], [ 1., 6.], [ 2., 6.], [ 1., 7.], [ 2., 7.], [ 1., 8.], [ 2., 8.], [ 1., 9.], [ 2., 9.]]) Using this syntax, interleave could be a one-liner. -Neil that method of 'filling an empty with a pattern' was mentioned in the other (general) interleaving question. It requires however a lot of particular numbers and :'s in the code, and requires even more statements which can hardly be written in functional style - in one line?. The other approach is more jount, free of fancy indexing assignments. jount? I think that assigning to a strided index is very clear, but that is a difference of opinion. All of the calls to np.empty are the equivalent of the column_stack's in your example. I think that operations on segments of arrays are fundamental to an array-processing language such as NumPy. Using ; you can put as many of those statements as you would like one line. :) The general interleaving should work efficiently in one like this: np.column_stack/concatenate((r,g,b,), axis=...).reshape(..) But as all this is not intuitive, something like this should be in numpy perhaps? : def interleave( tup_arrays, axis = None ) Here is a minimally tested implementation. If anyone really wants this for numpy, I'll gladly add comments and tests. I couldn't figure out how to automatically find the greatest dtype, so I added an argument to specify, otherwise it uses the type of the first array. def interleave(arrays, axis=0, dtype=None): assert len(arrays) 0 first = arrays[0] assert all([arr.shape == first.shape for arr in arrays]) new_shape = list(first.shape) new_shape[axis] *= len(arrays) if dtype is None: new_dtype = first.dtype else: new_dtype = dtype interleaved = np.empty(new_shape, new_dtype) axis_slice = [slice(None, None, None)]*axis + \
Re: [Numpy-discussion] How to remove fortran-like loops with numpy?
2009/6/8 Robert Kern robert.k...@gmail.com: On Mon, Jun 8, 2009 at 17:04, David Goldsmithd_l_goldsm...@yahoo.com wrote: I look forward to an instructive reply: the Pythonic way to do it would be to take advantage of the facts that Numpy is pre-vectorized and uses broadcasting, but so far I haven't been able to figure out (though I haven't yet really buckled down and tried real hard) how to broadcast a conditionally-terminated iteration where the number of iterations will vary among the array elements. Hopefully someone else already has. :-) You can't, really. What you can do is just keep iterating with the whole data set and ignore the parts that have already converged. Here is an example: Well, yes and no. This is only worth doing if the number of problem points that require many iterations is small - not the case here without some sort of periodicity detection - but you can keep an array of not-yet-converged points, which you iterate. When some converge, you store them in a results array (with fancy indexing) and remove them from your still-converging array. It's also worth remembering that the overhead of for loops is large but not enormous, so you can often remove only the inner for loop, in this case perhaps iterating over the image a line at a time. Anne z = np.zeros((201,201), dtype=complex) Y, X = np.mgrid[1:-1:-201j, -1.5:0.5:201j] c = np.empty_like(z) c.real = X c.imag = Y N = np.zeros(z.shape, dtype=int) while ((N30) | (abs(z)2)).all(): N += abs(z) 2 z = z ** 2 + c N[N=30] = 0 -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] extract elements of an array that are contained in another array?
2009/6/4 josef.p...@gmail.com: intersect1d should throw a domain error if you give it arrays with non-unique elements, which is not done for speed reasons It seems to me that this is the basic source of the problem. Perhaps this can be addressed? I realize maintaining compatibility with the current behaviour is necessary, so how about a multistage deprecation: 1. add a keyword argument to intersect1d assume_unique; if it is not present, check for uniqueness and emit a warning if not unique 2. change the warning to an exception Optionally: 3. change the meaning of the function to that of intersect1d_nu if the keyword argument is not present One could do something similar with setmember1d. This would remove the pitfall of the 1d assumption and the wart of the _nu names without hampering performance for people who know they have unique arrays and are in a hurry. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] performance matrix multiplication vs. matlab
2009/6/4 David Paul Reichert d.p.reich...@sms.ed.ac.uk: Hi all, I would be glad if someone could help me with the following issue: From what I've read on the web it appears to me that numpy should be about as fast as matlab. However, when I do simple matrix multiplication, it consistently appears to be about 5 times slower. I tested this using A = 0.9 * numpy.matlib.ones((500,100)) B = 0.8 * numpy.matlib.ones((500,100)) def test(): for i in range(1000): A*B.T I also used ten times larger matrices with ten times less iterations, used xrange instead of range, arrays instead of matrices, and tested it on two different machines, and the result always seems to be the same. Any idea what could go wrong? I'm using ipython and matlab R2008b. Apart from the implementation issues people have chimed in about already, it's worth noting that the speed of matrix multiplication depends on the memory layout of the matrices. So generating B instead directly as a 100 by 500 matrix might affect the speed substantially (I'm not sure in which direction). If MATLAB's matrices have a different memory order, that might be a factor as well. Anne Thanks, David -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Vectorizing array updates
2009/4/29 Dan Goodman dg.gm...@thesamovar.net: Robert Kern wrote: On Wed, Apr 29, 2009 at 16:19, Dan Goodman dg.gm...@thesamovar.net wrote: Robert Kern wrote: On Wed, Apr 29, 2009 at 08:03, Daniel Yarlett daniel.yarl...@gmail.com wrote: As you can see, Current is different in the two cases. Any ideas how I can recreate the behavior of the iterative process in a more numpy- friendly, vectorized (and hopefully quicker) way? Use bincount(). Neat. Is there a memory efficient way of doing it if the indices are very large but there aren't many of them? e.g. if the indices were I=[1, 2] then bincount would create a gigantic array of 2 elements for just two addition operations! indices -= indices.min() Ah OK, but bincount is still going to make an array of 1 elements in that case... I came up with this trick, but I wonder whether it's overkill: Supposing you want to do y+=x[I] where x is big and indices in I are large but sparse (although potentially including repeats). J=unique(I) K=digitize(I,J)-1 b=bincount(K) y+=b*x[J] Well it seems to work, but surely there must be a nicer way? Well, a few lines serve to rearrange the indices array so it contains each number only once, and then to add together (with bincount) all values with the same index: ix = np.unique1d(indices) reverse_index = np.searchsorted(ix, indices) r = np.bincount(reverse_index, weights=values) You can then do: y[ix] += r All this could be done away with if bincount supported input and output arguments. Then y[indices] += values could become np.bincount(indices, weights=values, input=y, output=y) implemented y the same nice tight loop one would use in C. (I'm not sure bincount is the right name for such a general function, though.) This question comes up fairly often on the mailing list, so it would be nice to have a single function to point to. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Another Array
2009/4/10 Ian Mallett geometr...@gmail.com: The vectors are used to jitter each particle's initial speed, so that the particles go in different directions instead of moving all as one. Using the unit vector causes the particles to make the smooth parabolic shape. The jitter vectors much then be of a random length, so that the particles go in all different directions at all different speeds, instead of just all in different directions. Why not just skip the normalization? Then you'll get vectors with random direction and a natural distribution of lengths. And it'll be faster than tne unit vectors... Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using reducing functions without eliminating dimensions?
2009/4/9 Charles R Harris charlesr.har...@gmail.com: On Tue, Apr 7, 2009 at 12:44 PM, Dan Lenski dlen...@gmail.com wrote: Hi all, I often want to use some kind of dimension-reducing function (like min(), max(), sum(), mean()) on an array without actually removing the last dimension, so that I can then do operations broadcasting the reduced array back to the size of the full array. Full example: table.shape (47, 1814) table.min(axis=1).shape (47,) table - table.min(axis=1) ValueError: shape mismatch: objects cannot be broadcast to a single shape table - table.min(axis=1)[:, newaxis] I have to resort to ugly code with lots of stuff like ... axis=1)[:, newaxis]. Is there any way to get the reducing functions to leave a size-1 dummy dimension in place, to make this easier? Not at the moment. There was talk a while back of adding a keyword for this option, it would certainly make things easier for some common uses. It might be worth starting that conversation up again. What's wrong with np.amin(a,axis=-1)[...,np.newaxis]? Anne Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Optical autocorrelation calculated with numpy is slow
2009/3/30 João Luís Silva jsi...@fc.up.pt: Hi, I wrote a script to calculate the *optical* autocorrelation of an electric field. It's like the autocorrelation, but sums the fields instead of multiplying them. I'm calculating I(tau) = integral( abs(E(t)+E(t-tau))**2,t=-inf..inf) You may be in trouble if there's cancellation, but can't you just rewrite this as E(t)**2+E(t-tau)**2-2*E(t)*E(t-tau)? Then you have two O(n) integrals and one standard autocorrelation... Anne with script appended at the end. It's too slow for my purposes (takes ~5 seconds, and scales ~O(N**2)). numpy's correlate is fast enough, but isn't what I need as it multiplies instead of add the fields. Could you help me get this script to run faster (without having to write it in another programming language) ? Thanks, João Silva # import numpy as np #import matplotlib.pyplot as plt n = 2**12 n_autocorr = 3*n-2 c = 3E2 w0 = 2.0*np.pi*c/800.0 t_max = 100.0 t = np.linspace(-t_max/2.0,t_max/2.0,n) E = np.exp(-(t/10.0)**2)*np.exp(1j*w0*t) #Electric field dt = t[1]-t[0] t_autocorr=np.linspace(-dt*n_autocorr/2.0,dt*n_autocorr/2.0,n_autocorr) E1 = np.zeros(n_autocorr,dtype=E.dtype) E2 = np.zeros(n_autocorr,dtype=E.dtype) Ac = np.zeros(n_autocorr,dtype=np.float64) E2[n-1:n-1+n] = E[:] for i in range(2*n-2): E1[:] = 0.0 E1[i:i+n] = E[:] Ac[i] = np.sum(np.abs(E1+E2)**2) Ac *= dt #plt.plot(t_autocorr,Ac) #plt.show() # ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] array of matrices
2009/3/28 Geoffrey Irving irv...@naml.us: On Sat, Mar 28, 2009 at 12:47 AM, Robert Kern robert.k...@gmail.com wrote: 2009/3/27 Charles R Harris charlesr.har...@gmail.com: On Fri, Mar 27, 2009 at 4:43 PM, Robert Kern robert.k...@gmail.com wrote: On Fri, Mar 27, 2009 at 17:38, Bryan Cole br...@cole.uklinux.net wrote: I have a number of arrays of shape (N,4,4). I need to perform a vectorised matrix-multiplication between pairs of them I.e. matrix-multiplication rules for the last two dimensions, usual element-wise rule for the 1st dimension (of length N). (How) is this possible with numpy? dot(a,b) was specifically designed for this use case. I think maybe he wants to treat them as stacked matrices. Oh, right. Sorry. dot(a, b) works when a is (N, 4, 4) and b is just (4, 4). Never mind. It'd be great if this operation existed as a primitive. What do you think would be the best way in which to add it? One option would be to add a keyword argument to dot giving a set of axes to map over. E.g., dot(a, b, map=0) = array([dot(u,v) for u,v in zip(a,b)]) # but in C map isn't a very good name for the argument, though. I think the right long-term solution is to make dot (and some other linear algebra functions) into generalized ufuncs, so that when you dot two multidimensional objects together, they are treated as arrays of two-dimensional arrays, broadcasting is done on all but the last two dimensions, and then the linear algebra is applied elementwise. This covers basically all stacked matrices uses in a very general way, but would require some redesigning of the linear algebra system - for example, dot() currently works on both two- and one-dimensional arrays, which can't work in such a setting. The infrastructure to support such generalized ufuncs has been added to numpy, but as far as I know no functions yet make use of it. Anne Geoffrey ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Cython numerical syntax revisited
2009/3/5 Francesc Alted fal...@pytables.org: A Thursday 05 March 2009, Francesc Alted escrigué: Well, I suppose that, provided that Cython could perform the for-loop transformation, giving support for strided arrays would be relatively trivial, and the performance would be similar than numexpr in this case. Mmh, perhaps not so trivial, because that implies that the stride of an array should be known in compilation time, and that would require a new qualifier when declaring the array. Tricky... Not necessarily. You can transform a[1,2,3] into *(a.data + 1*a.strides[0] + 2*a.strides[1] + 3*a.strides[2]) without any need for static information beyond that a is 3-dimensional. This would already be valuable, though perhaps you'd want to be able to declare that a particular dimension had stride 1 to simplify things. You could then use this same implementation to add automatic iteration. Anne Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Interpolation via Fourier transform
2009/3/5 M Trumpis mtrum...@berkeley.edu: Hi Nadav.. if you want a lower resolution 2d function with the same field of view (or whatever term is appropriate to your case), then in principle you can truncate your higher frequencies and do this: sig = ifft2_func(sig[N/2 - M/2:N/2 + M/2, N/2 - M/2:N/2+M/2]) I like to use an fft that transforms from an array indexing negative-to-positive freqs to an array that indexes negative-to-positive spatial points, so in both spaces, the origin is at (N/2,N/2). Then the expression works as-is. The problem is if you've got different indexing in one or both spaces (typically positive frequencies followed by negative) you can play around with a change of variables in your DFT in one or both spaces. If the DFT is defined as a computing frequencies from 0,N, then putting in n' = n-N/2 leads to a term like exp(1j*pi*q) that multiplies f[q]. Here's a toy example: a = np.cos(2*np.pi*5*np.arange(64)/64.) P.plot(np.fft.fft(a).real) P.plot(np.fft.fft(np.power(-1,np.arange(64))*a).real) The second one is centered about index N/2 Similarly, if you need to change the limits of the summation of the DFT from 0,N to -N/2,N/2, then you can multiply exp(1j*pi*n) to the outside of the summation. Like I said, easy enough in principle! There's also the hit-it-with-a-hammer approach: Just downsample in x then in y, using the one-dimensional transforms. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Floating point question
On 02/03/2009, Gideon Simpson simp...@math.toronto.edu wrote: I recently discovered that for 8 byte floating point numbers, my fortran compilers (gfortran 4.2 and ifort 11.0) on an OS X core 2 duo machine believe the smallest number 2.220507...E-308. I presume that my C compilers have similar results. I then discovered that the smallest floating point number in python 2.5 is 4.9065...E-324. I have been using numpy to generate data, saving it with savetxt, and then reading it in as ASCII into my fortran code. Recently, it crapped out on something because it didn't like reading it a number that small, though it is apparently perfectly acceptable to python. My two questions are: 1. What is the best way to handle this? Is it just to add a filter of the form u = u * ( np.abs(u) 2.3 e-308) 2. What gives? What's the origin of this (perceived) inconsistency in floating points across languages within the same platform? I recognize that this isn't specific to Scipy/Numpy, but thought someone here might have the answer. What's happening is that numbers like 1e-310 are represented by denormals. If we use base 10 to explain, suppose that floating point numbers could only have five digits of mantissa and two digits of exponent: 1.3000 e 00 2.1000 e-35 1. e-99 Now what to do if you divide that last number in half? You can write it as: 0.5000 e-99 But this is a bit of an anomaly: unlike all normal floating-point numbers, it has a leading zero, and there are only four digits of information in the mantissa. In binary it's even more of an anomaly, since in binary you can take advantage of the fact that all normal mantissas start with one by not bothering to store the one. But it turns out that implementing denormals is useful to provide graceful degradation as numbers underflow, so it's in IEEE. It appears that some FORTRAN implementations cannot handle denormals, which is giving you trouble. It's usually fairly safe to simply replace all denormals by zero. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Concatenating Arrays to make Views
2008/12/15 Benjamin Haynor bhay...@hotmail.com: I was wondering if I can concatenate 3 arrays, where the result will be a view of the original three arrays, instead of a copy of the data. For example, suppose I write the following import numpy as n a = n.array([[1,2],[3,4]]) b = n.array([[5,6],[7,8]]) c = n.array([[9,10],[11,12]]) c = n.r_[a,b] Now c = : [[1,2], [3,4], [5,6], [7,8], [9,10], [11,12]] I was hoping to get an array, such that, when I change d, a, b, and c will also change appropriately. Any ideas? An array must be a contiguous piece of memory, so this is impossible unless you allocate d first and make a b and c views of it. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Shape (z,0)
2008/11/28 T J [EMAIL PROTECTED]: import numpy as np x = np.ones((3,0)) x array([], shape(3,0), dtype=float64) To preempt, I'm not really concerned with the answer to: Why would anyone want to do this? I just want to know what is happening. Especially, with x[0,:] = 5 (which works). It seems that nothing is really happening here...given that, why is it allowed? Ie, are there reasons for not requiring the shape dimensions to be greater than 0? So that scripts can work transparently with arrays of all sizes: In [1]: import numpy as np In [3]: a = np.random.randn(5); b = a[a1]; print b.shape (1,) In [4]: a = np.random.randn(5); b = a[a1]; print b.shape (1,) In [5]: a = np.random.randn(5); b = a[a1]; print b.shape (0,) In [10]: b[:]=b[:]-1 The : just means all, so it's fine to use it if there aren't any (all of none). Basically, if this were not permitted there would be a zillion little corner cases code that was trying to be generic would have to deal with. There is a similar issue with zero-dimensional arrays for code that is trying to be generic for number of dimensions. That is, you want to be able to do something like: In [12]: a = a-np.mean(a,axis=-1)[...,np.newaxis] and have it work whether a is an n-dimensional array, in which case np.mean(a,axis=-1) is (n-1)-dimensional, or a is a one-dimensional array, in which case np.mean(a,axis=-1) is a zero-dimensional array, or maybe a scalar: In [13]: np.mean(a) Out[13]: 0.0 In [14]: 0.0[...,np.newaxis] --- TypeError Traceback (most recent call last) /homes/janeway/aarchiba/ipython console in module() TypeError: 'float' object is unsubscriptable In [15]: np.mean(a)[...,np.newaxis] Out[15]: array([ 0.]) Normalizing this stuff is still ongoing; it's tricky, because you often want something like np.mean(a) to be just a number, but generic code wants it to behave like a zero-dimensional array. Currently numpy supports both array scalars, that is, numbers of array dtypes, and zero-dimensional arrays; they behave mostly alike, but there are a few inconsistencies (and it's arguably redundant to have both). That said, it is often far easier to write generic code by flattening the input array, dealing with it as a guaranteed-one-dimensional array, then reconstructing the correct shape at the end, but such code is kind of a pain to write. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] New ufuncs
On 05/11/2008, Charles R Harris [EMAIL PROTECTED] wrote: On Tue, Nov 4, 2008 at 11:05 PM, T J [EMAIL PROTECTED] wrote: On Tue, Nov 4, 2008 at 9:37 PM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/11/5 Charles R Harris [EMAIL PROTECTED]: Hi All, I'm thinking of adding some new ufuncs. Some possibilities are expadd(a,b) = exp(a) + exp(b) -- For numbers stored as logs: Surely this should be log(exp(a)+exp(b))? That would be extremely useful, yes. +1 But shouldn't it be called 'logadd', for adding values which are stored as logs? Hmm... but I'm thinking one has to be clever here because the main reason I heard for using logs was that normal floating point numbers had insufficient range. So maybe something like logadd(a,b) = a + log(1 + exp(b - a)) where a b ? That's the usual way to do it, yes. I'd use log1p(exp(b-a)) for a little extra accuracy, though it probably doesn't matter. And yes, using logadd.reduce() is not the most efficient way to get a logsum(); no reason it can't be a separate function. As T J says, a logdot() would come in handy too. A python implementation is a decent first pass, but logdot() in particular would benefit from a C implementation. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to do: y[yT] = y+T
If what you are trying to do is actually ensure all data is within the range [a,b], you may be interested to know that python's % operator works on floating-point numbers: In [1]: -0.1 % 1 Out[1]: 0.90002 So if you want all samples in the range (0,1) you can just do y%=1. Anne 2008/10/27 Nicolas ROUX [EMAIL PROTECTED]: Thanks for all of you, for this fast and good reply ;-) Nicolas. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Douard Sent: Monday, October 27, 2008 12:51 PM To: numpy-discussion@scipy.org Subject: Re: [Numpy-discussion] How to do: y[yT] = y+T Le Monday 27 October 2008 12:41:06 Nicolas ROUX, vous avez écrit : Hello, I hope this is not a silly question ;-) I have a Numpy array, and I want to process it with : if the value is lower than Threshold, then increase by Threshold I would like to translate it as: y[yTreshold] = y + Treshold let's see : y[yT] += T Is it what you want ? To benefit from the Numpy speed. But this doesn't work, any idea ? Thanks, Cheers, Nicolas. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion -- David DouardLOGILAB, Paris (France), +33 1 45 32 03 12 Formations Python, Zope, Debian : http://www.logilab.fr/formations Développement logiciel sur mesure : http://www.logilab.fr/services Informatique scientifique : http://www.logilab.fr/science ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/9 David Bolme [EMAIL PROTECTED]: I have written up basic nearest neighbor algorithm. It does a brute force search so it will be slower than kdtrees as the number of points gets large. It should however work well for high dimensional data. I have also added the option for user defined distance measures. The user can set a default p. p has the same functionality if it is a float. p can also be a function that computes a distance matrix or the measure can be selected using the strings: Manhattan, Euclidean, or Correlation. https://pyvision.svn.sourceforge.net/svnroot/pyvision/trunk/src/pyvision/vector/knn.py This is interesting. I would point out, though, that if you want a Minkowski norm, it may be more efficient (that is, faster) to use the new compiled kd-tree implementation with leafsize set to the size of your data. This is written in compiled code and uses short-circuit distance evaluation, and may be much faster for high-dimensional problems. Given that, this should perhaps go in with other generic metric space code. I have a functional implementation of ball trees (though I don't know how efficient they are), and am looking into implementing cover trees. The interface is similar to Anne's code and in many cases can be used as a drop in replacement. I have posted the code to my own project because I have a short term need and I do not have access to the scipy repository. Feel free to include the code with scipy under the scipy license. I did find a typo your documentation. typo trie - tree - ... kd-tree is a binary trie, each of whose ... That's not actually a typo: a trie is a tree in which all the data is stored at leaf nodes. Basic kd-trees use the nodes themselves to define splitting planes; you can actually construct one with no extra storage at all, just by choosing an appropriate order for your array of points. This implementation chooses splitting planes that may not pass through any point, so all the points get stored in leaves. Also I found the use of k in the documentation some what confusing as it is the dimensionality of the data points in the kd-tree and the number of neighbors for k-nearest neighbors. That's a good point. I changed the dimension of the kd-tree to m throughout. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Data types in Numerical Python
2008/10/12 Linda Seltzer [EMAIL PROTECTED]: Here is an example that works for any working numpy installation: import numpy as npy npy.zeros((256, 256)) This suggestion from David did work so far, and removing the other import line enabled the program to run. However, the data types the program used as defaults for variables has changed, and now I am getting error messages about data types. It seems that some variables are getting a default designation as floats. Before I installed numpy and needed 2-D arrays, the program was working with the default types, and I did not have to specify types. Is there a clear tutorial that describes a means to assign data types for each variable as in C, so that I don't obtain error messages about data types? Python is a dynamically-typed language (unlike C), so variables do not have type. That is, a variable can refer to an object of any type; if you need to know what type of object a variable currently refers to you must inspect the object. You may want to go through one of the brief introduction-to-python tutorials that are on the python website just to get comfortable with the language. (For example, understanding the meaning and syntax of import statements.) When you create a numpy array, you can specify its type. You can also explicitly or implicitly convert the types of numpy arrays. I recommend you select a data type, let's say np.uint32, and make sure various arrays are created containing that type: np.zeros((m,n),dtype=np.uint32) np.arange(10,dtype=np.uint32) x.astype(np.uint32) np.array([1,2,3,4.5], dtype=np.uint32) et cetera. Most operations (addition, multiplication, maximum) will preserve the data type of arrays they are given (but if you supply two different data types numpy will attempt to choose the larger). Because I am simulating code for a DSP processor, the data types I need are unsigned bytes, unsigned 32-bit ints, and signed 32-bit ints. In some cases I can use unsigned and signed 16-bit ints. Also, what data types are valid for use with local operations such as exclusive or? The numpy data types you want are described by dtype objects. These can in principle become quite complicated, but the ones you need are given names already: np.uint8 np.uint32 np.int32 np.uint16 np.int16 You can specify any of these as dtype= arguments to the various numpy functions. If you need really rigid typing, python may not be the ideal language for you. Good luck, Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Vectorizing dot on the two last axis
2008/10/10 Gael Varoquaux [EMAIL PROTECTED]: I have been unable to vectorize the following operation:: window_size = 10 nb_windows = 819 nb_clusters = 501 restricted_series = np.random.random(size=(window_size, nb_clusters, nb_windows)) this_cov = np.zeros((nb_windows, nb_clusters, nb_clusters)) print sys.stderr, alive for index, window in enumerate(restricted_series): this_cov[index, ...] = np.dot(window, window.T) The last for loop is the one I am unhappy with. Note that it is fairly easy to get huge arrays when trying the vectorize this through np.dot or np.tensordot. I am unhappy with myself: I feel that it should be possible to vectorize this operation, but I cannot figure it out. I am pretty sure that, unfortunately, there is no way to vectorize this without an intermediate array substantially larger than either the inputs or the outputs. (Add one to the tally of people wishing for ufunc-like linear algebra.) From a computational point of view, this isn't particularly a problem: the intermediate array cannot be large enough to be a problem without the slices along at least one axis being large enough that for loop overhead is pretty minor. So you can just loop over this axis. Coding-wise, of course, this is a royal pain: unless you know ahead of time which axis is the smallest, you need to code several versions of your routine, and each version must include a for loop (just what numpy is supposed to help avoid). So: ufunc-like linear algebra would be great, but in the meantime, I guess we all learn to live with for loops. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] collecting the bluest pixels
2008/10/7 paul taney [EMAIL PROTECTED]: Hi, I have this silly color filter that Stefan gave me: def vanderwalt(image, f): colorfilter, thanks to Stefan van der Walt RED, GRN, BLU = 0, 1, 2 bluemask = (image[...,BLU] f*image[...,GRN]) \ (image[...,BLU] f*image[...,RED]) return bluemask To collect the right number of the bluest pixels I am calling it from this arduous successive approximation routine. It occured to me that someone on this list knows how to do this in a couple of lines... Well, I can see several approaches. The most direct way to do what you're asking is to use scipy.optimize.bisect to implement the successive approximations. That'll be almost as slow as your current approach, though. Instead, I'd first write a function that measures the blueness of each pixel: def blueness(image): A = np.empty(image.shape[:-1],dtype=np.float32) np.divide(image[...,BLU],image[...,RED],A) # use three-argument divide to reduce the number of float temporaries B = np.empty(image.shape[:-1],dtype=np.float32) np.divide(image[...,BLU],image[...,GRN],B) return np.minimum(A,B) Now, once you have the bluenesses, you can sort them and pull out the blueness that gives you the percent you want: bluenesses = np.sort(blueness(image),axis=None) # axis=None flattens the array factor = bluenesses[int((1-wanted_fraction)*len(bluenesses))] If you want a smarter blueness filter, you could look into using HSV: you could then specify a distance in hue and a distance in saturation, based on how relatively important you think they are. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Merge of generalised ufuncs branch
2008/10/7 Stéfan van der Walt [EMAIL PROTECTED]: The generalised ufuncs branch was made available before SciPy'08. We solicited comments on its implementation and structuring, but received very little feedback. Unless there are any further comments from the community, I propose that we merge it. Sounds good to me - I've counted at least three or four threads on the mailing lists wishing for the ufuncized linear algebra this would allow since it was put forward. (Of course, these won't appear until someone implements them - perhaps a start would be for someone to write a tutorial on using the new generalized ufunc code...) It is unfortunate that we have so many patches waiting for review (SciPy suffers worst, I'm afraid); clearly there are too few hours in a day. Nothing discourages contributions as much as a stale project, and I hope we can avoid that. The problem may perhaps be that not many people feel they are in a position to actually do the reviews, so that everyone is waiting on an imagined few real developers to place the Official Stamp of Approval. Perhaps an informal rule of thumb for acceptance of patches? How about: posted to list, reviewed by someone not the author who's had at least one patch accepted before, and no objections on the list? Anything that receives objections can fall through to the usual decision by discussion on the mailing list, this is just intended for those patches that everyone just kind of shrugs and says looks all right to me. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] collecting the bluest pixels
2008/10/7 Christopher Barker [EMAIL PROTECTED]: I wonder if the euclidian norm would make sense for this application: HowFarFromBlue = np.sqrt((255-image[...,BLU])**2 + image[...,GRN]**2 + image[...,RED]**2) smaller numbers would be bluest -- pure blue would be 0, pure red 360, etc... One thing I like about this is that your blue may not exactly be an RBG blue -- so you could see how far a given pixel was from any given color -- in your case, whatever your blue is. Then it would be something like: r, g, b = ref_color HowFarFromRefColor = np.sqrt((r - image[...,RED])**2 + (g - image[...,GRN])**2 + (b - image[...,BLU])**2 ) NOTE: I know nothing of image prcessing -- Ill be there is are some standard ways to determine how close two colors are. It's a tricky problem, but if you're serious about it you can use Euclidean distance in the CIELUV colour space, or if you're really serious Euclidean distance in CIELAB. Both probably overkill for this project. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/3 David Bolme [EMAIL PROTECTED]: I remember reading a paper or book that stated that for data that has been normalized correlation and Euclidean are equivalent and will produce the same knn results. To this end I spent a couple hours this afternoon doing the math. This document is the result. http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf Yes, you're right, even without the mean subtraction they all lie on a hypersphere in Euclidean space. It's a little more awkward if you want to identify x and -x. I believe that with mean subtracted and unit length vectors, a Euclidean knn algorithm will produces the same result as if the vectors were compared using correlation. I am not sure if kd-trees will perform well on the normalized vectors as they have a very specific geometry. If my math checks out it may be worth adding Pearson's correlation as a default option or as a separate class. Actually it's probably easier if the user just does the prenormalization. I have also spent a little time looking at kd-trees and the kdtree code. It looks good. As I understand it kd-trees only work well when the number of datapoints (N) is larger than 2^D, where D is the dimensionality of those points. This will not work well for many of my computer vision problems because often D is large. As Anne suggested I will probably look at cover trees because often times the data are low-dimensional data in high-dimensional spaces. I have been told though that with a large D there is know known fast algorithm for knn. Pretty much true. Though if the intrinsic dimensionality is low, cover trees should be all right. Another problem is that the distances and similarity measures used in biometrics and computer vision are often very specialized and may or may not conform to the underlying assumptions of fast algorithms. I think for this reason I will need an exhaustive search algorithm. I will code it up modeled after Anne's interface and hopefully it will make it into the spatial module. Metric spaces are quite general - edit distance for strings, for example, is an adequate distance measure. But brute-force is definitely worth having too. If I get the test suite cleaned up, it should be possible to just drop an arbitrary k-nearest-neighbors class into it and get a reasonably thorough collection of unit tests. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/2 David Bolme [EMAIL PROTECTED]: It may be useful to have an interface that handles both cases: similarity and dissimilarity. Often I have seen Nearest Neighbor algorithms that look for maximum similarity instead of minimum distance. In my field (biometrics) we often deal with very specialized distance or similarity measures. I would like to see support for user defined distance and similarity functions. It should be easy to implement by passing a function object to the KNN class. I am not sure if kd-trees or other fast algorithms are compatible with similarities or non-euclidian norms, however I would be willing to implement an exhaustive search KNN that would support user defined functions. kd-trees can only work for distance measures which have certain special properties (in particular, you have to be able to bound them based on coordinate differences). This is just fine for all the Minkowski p-norms (so in particular, Euclidean distance, maximum-coordinate-difference, and Manhattan distance) and in fact the current implementation already supports all of these. I don't think that correlation can be made into such a distance measure - the neighborhoods are the wrong shape. In fact the basic space is projective n-1 space rather than affine n-space, so I think you're going to need some very different algorithm. If you make a metric space out of it - define d(A,B) to be the angle between A and B - then cover trees can serve as a spatial data structure for nearest-neighbor search. Cover trees may be worth implementing, as they're a very generic data structure, suitable for (among other things) low-dimensional data in high-dimensional spaces. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/1 Gael Varoquaux [EMAIL PROTECTED]: On Tue, Sep 30, 2008 at 06:10:46PM -0400, Anne Archibald wrote: k=None in the third call to T.query seems redundant. It should be possible do put some logics so that the call is simply distances, indices = T.query(xs, distance_upper_bound=1.0) Well, the problem with this is that you often want to provide a distance upper bound as well as a number of nearest neighbors. Absolutely. I just think k should default to None, when distance_upper_bound is specified. k=None could be interpreted as k=1 when distance_uppper_bound is not specified. That seems very confusing. Better perhaps to have a function query_all_neighbors, even if it's just a wrapper. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/1 Barry Wark [EMAIL PROTECTED]: Thanks for taking this on. The scikits.ann has licensing issues (as noted above), so it would be nice to have a clean-room implementation in scipy. I am happy to port the scikits.ann API to the final API that you choose, however, if you think that would be helpful. That's a nice idea. I'm not totally sure yet how much it's going to be possible for different implementations to be plug-in replacements, but it sure would be nice if users could use ANN transparently. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Upper and lower envelopes
2008/9/30 bevan [EMAIL PROTECTED]: Hello, I have some XY data. I would like to generate the equations for an upper and lower envelope that excludes a percentage of the data points. I would like to define the slope of the envelope line (say 3) and then have my code find the intercept that fits my requirements (say 5% of data below the lower envelope). This would then give me the equation and I could plot the upper and lower envelopes. I hope this makes sense. Thanks for any help. For this particular problem - where you know the slope - it's not too hard. If the slope is b, and your points are x and y, compute y-b*x, then sort that array, and choose the 5th and 95th percentile values. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/9/30 Peter [EMAIL PROTECTED]: On Tue, Sep 30, 2008 at 5:10 AM, Christopher Barker [EMAIL PROTECTED] wrote: Anne Archibald wrote: I suggest the creation of a new submodule of scipy, scipy.spatial, +1 Here's one to consider: http://pypi.python.org/pypi/Rtree and perhaps other stuff from: http://trac.gispython.org/spatialindex/wiki which I think is LGPL -- can scipy use that? There is also a KDTree module in Biopython (which is under a BSD/MIT style licence), http://biopython.org/SRC/biopython/Bio/KDTree/ The current version is in C, there is an older version available in the CVS history in C++ too, http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/KDTree/?cvsroot=biopython I think the profusion of different implementations is an argument for including this in scipy. I think it is also an argument for providing a standard interface with (at least potentially) several different implementations. At the moment, that proposed interface looks like: T = KDTree(data) distances, indices = T.query(xs) # single nearest neighbor distances, indices = T.query(xs, k=10) # ten nearest neighbors distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) # all within 1 of x In the first two cases, missing neighbors are represented with an infinite distance and an invalid index. In the last case, distances and indices are both either lists (if there's only one query point) or object arrays of lists (if there are many query points). If only one neighbor is requested, the array does not have a dimension of length 1 in the which-neighbor position. If (potentially) many neighbors are returned, they are sorted by distance, nearest first. What do you think of this interface? It may make sense to provide additional kinds of query - nearest neighbors between two trees, for example - some of which would be available only for some implementations. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/9/30 Gael Varoquaux [EMAIL PROTECTED]: On Tue, Sep 30, 2008 at 05:31:17PM -0400, Anne Archibald wrote: T = KDTree(data) distances, indices = T.query(xs) # single nearest neighbor distances, indices = T.query(xs, k=10) # ten nearest neighbors distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) # all within 1 of x What do you think of this interface? k=None in the third call to T.query seems redundant. It should be possible do put some logics so that the call is simply distances, indices = T.query(xs, distance_upper_bound=1.0) Well, the problem with this is that you often want to provide a distance upper bound as well as a number of nearest neighbors. For example, suppose you are trying to find the point of closest approach of a path to the set of points stored in a kd-tree. You do the first query normally, but then since the second point is close to the first, you can accelerate the search dramatically by providing an upper bound of the distance from the first point's nearest neighbor to the second point. With a little luck, this will save ever having to visit much of the tree. It is also possible, of course, to provide wrappers to query() that do just one thing; I had this in mind more for fiddling the output (returning only the distance to the nearest neighbor, for instance). Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Proposal: scipy.spatial
Hi, Once again there has been a thread on the numpy/scipy mailing lists requesting (essentially) some form of spatial data structure. Pointers have been posted to ANN (sadly LGPLed and in C++) as well as a handful of pure-python implementations of kd-trees. I suggest the creation of a new submodule of scipy, scipy.spatial, to contain spatial data structures and algorithms. Specifically, I propose it contain a kd-tree implementation providing nearest-neighbor, approximate nearest-neighbor, and all-points-near queries. I have a few other suggestions for things it might contain, but kd-trees seem like a good first step. 2008/9/27 Nathan Bell [EMAIL PROTECTED]: On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald [EMAIL PROTECTED] wrote: I think a kd-tree implementation would be a valuable addition to scipy, perhaps in a submodule scipy.spatial that might eventually contain other spatial data structures and algorithms. What do you think? Should we have one? Should it be based on Sturla Molden's code, if the license permits? I am willing to contribute one, if not. +1 Judging that your vote and mine are enough in the absence of dissenting voices, I have written an implementation based on yours, Sturla Molden's, and the algorithms described by the authors of the ANN library. Before integrating it into scipy, though, I'd like to send it around for comments. Particular issues: * It's pure python right now, but with some effort it could be partially or completely cythonized. This is probably a good idea in the long run. In the meantime I have crudely vectorized it so that users can at least avoid looping in their own code. * It is able to return the r nearest neighbors, optionally subject to a maximum distance limit, as well as approximate nearest neighbors. * It is not currently set up to conveniently return all points within some fixed distance of the target point, but this could easily be added. * It returns distances and indices of nearest neighbors in the original array. * The testing code is, frankly, a mess. I need to look into using nose in a sensible fashion. * The license is the scipy license. I am particularly concerned about providing a convenient return format. The natural return from a query is a list of neighbors, since it may have variable length (there may not be r elements in the tree, or you might have supplied a maximum distance which doesn't contain r points). For a single query, it's simple to return a python list (should it be sorted? currently it's a heap). But if you want to vectorize the process, storing the results in an array becomes cumbersome. One option is an object array full of lists; another, which, I currently use, is an array with nonexistent points marked with an infinite distance and an invalid index. A third would be to return masked arrays. How do you recommend handling this variable (or potentially-variable) sized output? If you're implementing one, I would highly recommend the left-balanced kd-tree. http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf Research suggests that at least in high dimension a more geometric balancing criterion can produce vastly better results. I used the sliding midpoint rule, which doesn't allow such a numerical balancing but does ensure that you don't have too many long skinny cells (since a sphere tends to cut very many of these). I've also been thinking about what else would go in scipy.spatial. I think it would be valuable to have a reasonably efficient distance matrix function (we seem to get that question a lot, and the answer's not trivial) as well as a sparse distance matrix function based on the kd-trees. The module could potentially also swallow the (currently sandboxed?) delaunay code. Anne # Copyright Anne M. Archibald 2008 # Released under the scipy license import numpy as np from heapq import heappush, heappop def distance_p(x,y,p=2): if p==np.inf: return np.amax(np.abs(y-x),axis=-1) elif p==1: return np.sum(np.abs(y-x),axis=-1) else: return np.sum(np.abs(y-x)**p,axis=-1) def distance(x,y,p=2): if p==np.inf or p==1: return distance_p(x,y,p) else: return distance_p(x,y,p)**(1./p) class KDTree(object): kd-tree for quick nearest-neighbor lookup This class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point. The algorithm used is described in Maneewongvatana and Mount 1999. The general idea is that the kd-tree is a binary trie, each of whose nodes represents an axis-aligned hyperrectangle. Each node specifies an axis and splits the set of points based on whether their coordinate along that axis is greater than or less than a particular value. During construction, the axis and splitting point are chosen by the sliding midpoint rule, which ensures that the cells do not all become long and thin. The tree
Re: [Numpy-discussion] Are there command similar as Matlab find command?
2008/9/30 frank wang [EMAIL PROTECTED]: I really do not know the difference of debug mode and the pdb debugger. To me, it seems that pdb is only way to debug the python code. How do the expert of numpy/python debug their code? Are there any more efficient way to debug the code in python world? I used to use matlab which comes with a nice debugger. But now I really want to try the open source software and I found python/numpy is a nice tool. The only lagging piece of python/numpy comparing with matlab is the powerful debuggign capability. I think you will find that the debugger, pdb, is a good analog to matlab's debugging mode. The way I usually debug a piece of code looks something like: [in window A] edit foomodule.py to add a function frobnicate() [in window B, which has a running ipython session] import foomodule reload(foomodule) frobnicate() some exception happens %pdb frobnicate() exception happens again (pdb) print some_variable 47 [in window A] wait a minute, some_variable should be 13 at this point... edit foomodule.py to fix the bug [in window B] reload(foomodule) frobnicate() 21 Alternatively, I write a test_foomodule.py, which contains a suite of unit tests. I run these tests and get dropped into a debugger when one fails, and I try to track it down. The main problem you ran into is that the debugger is not just any old python prompt. It has its own commands. It also, confusingly, will accept python commands, as long as they don't conflict with an internal command. I got into the habit of, when I wanted to run some python thing, writing (pdb) p 17+13 rather than just (pdb) 17+13 If you had simply written (pdb) p where(a13) everything would have been fine. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] nonuniform scatter operations
2008/9/28 Geoffrey Irving [EMAIL PROTECTED]: Is there an efficient way to implement a nonuniform gather operation in numpy? Specifically, I want to do something like n,m = 100,1000 X = random.uniform(size=n) K = random.randint(n, size=m) Y = random.uniform(size=m) for k,y in zip(K,Y): X[k] += y but I want it to be fast. The naive attempt X[K] += Y does not work, since the slice assumes the indices don't repeat. I believe histogram can be persuaded to do this. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Minimum distance between 2 paths in 3D
2008/9/27 Andrea Gavana [EMAIL PROTECTED]: I was wondering if someone had any suggestions/references/snippets of code on how to find the minimum distance between 2 paths in 3D. Basically, for every path, I have I series of points (x, y, z) and I would like to know if there is a better way, other than comparing point by point the 2 paths, to find the minimum distance between them. If you treat the points as simply two clouds of points (that is, ignore the fact that they represent points on paths), spatial data structures can make this kind of question faster. For example a kd-tree can be constructed in something like O(n log n) time, and you can answer questions like what is the closest point in this set to (a,b,c)? in something like O(log n) time. So you could do this by putting one set of points in a kd-tree and just running queries against each point in the other. There exist other spatial data structures - octrees, voxel grids, bounding-volume hierarchies, other binary space partitioning trees, as well as various more specialized ones. As the number of dimensions becomes large they become substantially more difficult to work with, and you do need to balance construction time against lookup time, but they are definitely worth thinking about in your problem. Sadly, no such data structure exists in either numpy or scipy, though biopython apparently has an implementation of kd-trees. Good luck, Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Minimum distance between 2 paths in 3D
2008/9/27 Robert Kern [EMAIL PROTECTED]: On Sat, Sep 27, 2008 at 15:23, Anne Archibald [EMAIL PROTECTED] wrote: 2008/9/27 Andrea Gavana [EMAIL PROTECTED]: I was wondering if someone had any suggestions/references/snippets of code on how to find the minimum distance between 2 paths in 3D. Basically, for every path, I have I series of points (x, y, z) and I would like to know if there is a better way, other than comparing point by point the 2 paths, to find the minimum distance between them. If you treat the points as simply two clouds of points (that is, ignore the fact that they represent points on paths), spatial data structures can make this kind of question faster. For example a kd-tree can be constructed in something like O(n log n) time, and you can answer questions like what is the closest point in this set to (a,b,c)? in something like O(log n) time. So you could do this by putting one set of points in a kd-tree and just running queries against each point in the other. There exist other spatial data structures - octrees, voxel grids, bounding-volume hierarchies, other binary space partitioning trees, as well as various more specialized ones. As the number of dimensions becomes large they become substantially more difficult to work with, and you do need to balance construction time against lookup time, but they are definitely worth thinking about in your problem. Sadly, no such data structure exists in either numpy or scipy, though biopython apparently has an implementation of kd-trees. Sturla Molden has just contributed a pure Python+numpy implementation on the Scipy Cookbook. http://www.scipy.org/Cookbook/KDTree I think a kd-tree implementation would be a valuable addition to scipy, perhaps in a submodule scipy.spatial that might eventually contain other spatial data structures and algorithms. What do you think? Should we have one? Should it be based on Sturla Molden's code, if the license permits? I am willing to contribute one, if not. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proper NaN handling in max and co: a redux
2008/9/26 David Cournapeau [EMAIL PROTECTED]: Charles R Harris wrote: I'm also wondering about the sign ufunc. It should probably return nan for nans, but -1,0,1 are the current values. We also need to decide which end of the sorted values the nans should go to. I'm a bit partial to the beginning but the end would be fine with me, it might even be a bit more natural as searchsorted would find the beginning of the nans by default. I would think sign should return NaN (does it not now?) unless its return type is integer, in which case I can't see a better answer than raising an exception (we certainly don't want it silently swallowing NaNs). Note that in both suggest approaches, sort with NaN would raise an error. We could then provide a nansort, which ignore the NaN, and deal with your case; would it be hard to have an option for the beginning vs the end, or would that be difficult ? Is it really a good idea to duplicate the maskedarray sorting code? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Standard functions (z-score) on nan (again)
2008/9/25 Peter Saffrey [EMAIL PROTECTED]: I've bodged my way through my median problems (see previous postings). Now I need to take a z-score of an array that might contain nans. At the moment, if the array, which is 7000 elements, contains 1 nan or more, all the results come out as nan. My other problem is that my array is indexed from somewhere else - position 0 holds the value for a particular id, position 1 holds the value for a particular id and so on, so if I remove the nans, all the indices are messed up. Can anybody suggest a sensible fix for this problem? I know I should really be using masked arrays, but I can't seem to find a masked-array version of zs. Is there one? Just keep an array of indices: c = ~np.isnan(A) Aclean = A[c] indices = np.arange(len(A))[c] Then the original index of Aclean[i] is indices[i]. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Medians that ignore values
2008/9/19 Eric Firing [EMAIL PROTECTED]: Pierre GM wrote: It seems to me that there are pragmatic reasons why people work with NaNs for missing values, that perhaps shd not be dismissed so quickly. But maybe I am overlooking a simple solution. nansomething solutions tend to be considerably faster, that might be one reason. A lack of visibility of numpy.ma could be a second. In any case, I can't but agree with other posters: a NaN in an array usually means something went astray. Additional reasons for using nans: 1) years of experience with Matlab, in which using nan for missing values is the standard idiom. Users are already retraining to use zero-based indexing; I don't think asking them to use a full-featured masked array package is an unreasonable retraining burden, particularly since this idiom breaks as soon as they want to work with arrays of integers or records. 2) convenient interfacing with extension code in C or C++. The latter is a factor in the present use of nan in matplotlib; using nan for missing values in an array passed into extension code saves having to pass and process a second (mask) array. It is fast and simple. How hard is it to pass an array where the masked values have been filled with nans? It's certainly easy to go the other way (mask all nans). I think this is less painful than supporting two differently-featured sets of functions for dealing with arrays containing some invalid values. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Medians that ignore values
2008/9/19 David Cournapeau [EMAIL PROTECTED]: Anne Archibald wrote: That was in amax/amin. Pretty much every other function that does comparisons needs to be fixed to work with nans. In some cases it's not even clear how: where should a sort put the nans in an array? The problem is more on how the functions use sort than sort itself in the case of median. There can't be a 'good' way to put nan in soft, for example, since nans cannot be ordered. Well, for example, you might ask that all the non-nan elements be in order, even if you don't specify where the nan goes. I don't know about the best strategy: either we fix every function using comparison, handling nan as a special case as you mentioned, or there may be a more clever thing to do to avoid special casing everywhere. I don't have a clear idea of how many functions rely on ordering in numpy. You can always just set numpy to raise an exception whenever it comes across a nan. In fact, apart from the difficulty of correctly frobbing numpy's floating-point handling, how reasonable is it for (say) median to just run as it is now, but if an exception is thrown, fall back to a nan-aware version? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Medians that ignore values
2008/9/19 Pierre GM [EMAIL PROTECTED]: On Friday 19 September 2008 03:11:05 David Cournapeau wrote: Hm, I am always puzzled when I think about nan handling :) It always seem there is not good answer. Which is why we have masked arrays, of course ;) I think the numpy attitude to nans should be that they are unexpected bogus values that signify that something went wrong with the calculation somewhere. They can be left in place for most operations, but any operation that depends on the value should (ideally) return nan, or failing that, raise an exception. (If users want exceptions all the time, that's what seterr is for.) If people want to flag bad data, let's tell them to use masked arrays. So by this rule amax/maximum/mean/median should all return nan when there's a nan in their input; I don't think it's reasonable for sort to return an array full of nans, so I think its default behaviour should be to raise an exception if there's a nan. It's valuable (for example in median) to be able to sort them all to the end, but I don't think this should be the default. If people want nanmin, I would be tempted to tell them to use masked arrays (is there a convenience function that makes a masked array with a mask everywhere the data is nan?). I am assuming that appropriate masked sort/amax/maximum/mean/median exist already. They're definitely needed, so how much effort is it worth putting in to duplicate that functionality with nans instead of masked elements? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Generating random samples without repeats
2008/9/19 Paul Moore [EMAIL PROTECTED]: Robert Kern robert.kern at gmail.com writes: On Thu, Sep 18, 2008 at 16:55, Paul Moore pf_moore at yahoo.co.uk wrote: I want to generate a series of random samples, to do simulations based on them. Essentially, I want to be able to produce a SAMPLESIZE * N matrix, where each row of N values consists of either 1. Integers between 1 and M (simulating M rolls of an N-sided die), or 2. A sample of N numbers between 1 and M without repeats (simulating deals of N cards from an M-card deck). Example (1) is easy, numpy.random.random_integers(1, M, (SAMPLESIZE, N)) But I can't find an obvious equivalent for (2). Am I missing something glaringly obvious? I'm using numpy - is there maybe something in scipy I should be looking at? numpy.array([(numpy.random.permutation(M) + 1)[:N] for i in range(SAMPLESIZE)]) Thanks. And yet, this takes over 70s and peaks at around 400M memory use, whereas the equivalent for (1) numpy.random.random_integers(1,M,(SAMPLESIZE,N)) takes less than half a second, and negligible working memory (both end up allocating an array of the same size, but your suggestion consumes temporary working memory - I suspect, but can't prove, that the time taken comes from memory allocations rather than computation. As a one-off cost initialising my data, it's not a disaster, but I anticipate using idioms like this later in my calculations as well, where the costs could hurt more. If I'm going to need to write C code, are there any good examples of this? (I guess the source for numpy.random is a good place to start). This was discussed on one of the mailing lists several months ago. It turns out that there is no simple way to efficiently choose without replacement in numpy/scipy. I posted a hack that does this somewhat efficiently (if SAMPLESIZEM/2, choose the first SAMPLESIZE of a permutation; if SAMPLESIZEM/2, choose with replacement and redraw any duplicates) but it's not vectorized across many sample sets. Is your problem large M or large N? what is SAMPLESIZE/M? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Medians that ignore values
2008/9/19 David Cournapeau [EMAIL PROTECTED]: I guess my formulation was poor: I never use NaN as missing values because I never use missing values, which is why I wanted the opinion of people who use NaN in a different manner (because I don't have a good idea on how those people would like to see numpy behave). I was certainly not arguing they should not be use for the purpose of missing value. I, on the other hand, was making specifically that suggestion: users should not use nans to indicate missing values. Users should use masked arrays to indicate missing values. The problem with NaN is that you cannot mix the missing value behavior and the error behavior. Dealing with them in a consistent manner is difficult. Because numpy is a general numerical computation tool, I think that NaN should be propagated and never ignored *by default*. If you have NaN because of divide by 0, etc... it should not be ignored at all. But if you want it to ignore, then numpy should make it possible: - max, min: should return NaN if NaN is in the array, or maybe even fail ? - argmax, argmin ? - sort: should fail ? - mean, std, variance: should return Nan - median: should fail (to be consistent if sort fails) ? Should return NaN ? This part I pretty much agree with. We could then add an argument to failing functions to tell them either to ignore NaN/put them at some special location (like R does, for example). The ones I am not sure are median and argmax/argmin. For median, failing when sort does is consistent; but this can break a lot of code. For argmin/argmax, failing is the most logical, but OTOH, making argmin/argmax failing and not max/min is not consistent either. Breaking the code is maybe not that bad because currently, neither max/min nor argmax/argmin nor sort does return a meaningful function. Does that sound reasonable to you ? The problem with this approach is that all those decisions need to be made and all that code needs to be implemented for masked arrays. In fact I suspect that it has already been done in that case. So really what you are suggesting here is that we duplicate all this effort to implement the same functions for nans as we have for masked arrays. It's important, too, that the masked array implementation and the nan implementation behave the same way, or users will become badly confused. Who gets the task of keeping the two implementations in sync? The current situation is that numpy has two ways to indicate bad data for floating-point arrays: nans and masked arrays. We can't get rid of either: nans appear on their own, and masked arrays are the only way to mark bad data in non-floating-point arrays. We can try to make them behave the same, which will be a lot of work to provide redundant capabilities. Or we can make them behave drastically differently. Masked arrays clearly need to be able to handle masked values flexibly and explicitly. So I think nans should be handled simply and conservatively: propagate them if possible, raise if not. If users are concerned about performance, it's worth noting that on some machines nans force a fallback to software floating-point handling, with a corresponding very large performance hit. This includes some but not all x86 (and I think x86-64) CPUs. How this compares to the performance of masked arrays is not clear. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Medians that ignore values
2008/9/18 David Cournapeau [EMAIL PROTECTED]: Peter Saffrey wrote: Is this the correct behavior for median with nan? That's the expected behavior, at least :) (this is also the expected behavior of most math packages I know, including matlab and R, so this should not be too surprising if you have used those). I don't think I agree: In [4]: np.median([1,3,nan]) Out[4]: 3.0 In [5]: np.median([1,nan,3]) Out[5]: nan In [6]: np.median([nan,1,3]) Out[6]: 1.0 I think the expected behaviour would be for all of these to return nan. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] subsampling array without loops
2008/8/22 Catherine Moroney [EMAIL PROTECTED]: I'm looking for a way to acccomplish the following task without lots of loops involved, which are really slowing down my code. I have a 128x512 array which I want to break down into 2x2 squares. Then, for each 2x2 square I want to do some simple calculations such as finding the maximum value, which I then store in a 64x256 array. You should be able to do some of this with reshape and transpose: In [1]: import numpy as np In [3]: A = np.zeros((128,512)) In [4]: B = np.reshape(A,(64,2,256,2)) In [5]: C = np.transpose(B,(0,2,1,3)) In [6]: C.shape Out[6]: (64, 256, 2, 2) Now C[i,j] is the 2 by 2 array [ [A[2*i,2*j], A[2*i,2*j+1]], [A[2*i+1,2*j], A[2*i+1, 2*j+1]] ] (or maybe I got those indices the wrong way around.) Now most operations you want to do on the two-by-two matrices can be done, using ufuncs, on the whole array at once. For example if you wanted the max of each two-by-two matrix, you'd write: In [7]: D = np.amax(np.amax(C,axis=-1),axis=-1) In [8]: D.shape Out[8]: (64, 256) Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] doing zillions of 3x3 determinants fast
2008/8/25 Daniel Lenski [EMAIL PROTECTED]: On Mon, 25 Aug 2008 03:48:54 +, Daniel Lenski wrote: * it's fast enough for 100,000 determinants, but it bogs due to all the temporary arrays when I try to do 1,000,000 determinants (=72 MB array) I've managed to reduce the memory usage significantly by getting the number of temporary arrays down to exactly two: def det3(ar): a=ar[...,0,0]; b=ar[...,0,1]; c=ar[...,0,2] d=ar[...,1,0]; e=ar[...,1,1]; f=ar[...,1,2] g=ar[...,2,0]; h=ar[...,2,1]; i=ar[...,2,2] t=a.copy(); t*=e; t*=i; tot =t t=b.copy(); t*=f; t*=g; tot+=t t=c.copy(); t*=d; t*=h; tot+=t t=g.copy(); t*=e; t*=c; tot-=t t=h.copy(); t*=f; t*=a; tot-=t t=i.copy(); t*=d; t*=b; tot-=t return tot Now it runs very fast with 1,000,000 determinants to do (10X the time required to do 100,000) but I'm still worried about the stability with real-world data. As far as I know, that's the best you can do (though come to think of it, a 3D determinant is a cross-product followed by a dot product, so you might be able to cheat and use some built-in routines). It's a current limitation of numpy that there is not much support for doing many linear algebra problems. We do have perennial discussions about improving support for array-of-matrices calculations, and in fact currently in numpy SVN is code to allow C code doing a 3x3 determinant to be easily made into a generalized ufunc that does ufunc-like broadcasting and iteration over all but the last two axes. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Generalized ufuncs?
2008/8/17 Robert Kern [EMAIL PROTECTED]: I suggested that we move it to a branch for the time being so we can play with it and come up with examples of its use. If you have examples that you have already written, I would love to see them. I, for one, am amenable to seeing this in 1.2.0, but only if we push back the release by at least a few weeks. This is not a worked example, but this is exactly what is needed to make possible the arrays of matrices functions that were discussed some time ago. For example, users frequently want to do something like multiply an array of matrices by an array of matrices; that is not currently feasible without a very large temporary array (or a loop). But I think you were looking for examples of code using the interface, to see whether it worked well. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy 1.2.0b2 released
2008/8/15 Andrew Dalke [EMAIL PROTECTED]: On Aug 15, 2008, at 6:41 PM, Andrew Dalke wrote: I don't think it's enough. I don't like environmental variable tricks like that. My tests suggest: current SVN: 0.12 seconds my patch: 0.10 seconds removing some top-level imports: 0.09 seconds my patch and removing some additional top-level imports: 0.08 seconds (this is a guess) First, I reverted my patch, so my import times went from 0.10 second to 0.12 seconds. Turns out I didn't revert everything. As of the SVN version from 10 minutes ago, import numpy on my machine takes 0.18 seconds, not 0.12 seconds. My patch should cut the import time by about 30-40% more from what it is. On some machines. Your milage may vary :) I realize this is already a very complicated issue, but it's worth pointing out that the times you measure are not necessarily the times users care about. These numbers are once everything is loaded into disk cache. They don't reflect, say, interactive startup time, or time it takes in a script that uses substantial disk access (i.e. which fills the cache with something else). I realize this is the only available basis for comparison, but do keep in mind that improvements of a few milliseconds here may make a much larger difference in practice - or a much smaller difference. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
2008/8/14 Norbert Nemec [EMAIL PROTECTED]: Travis E. Oliphant wrote: NAN's don't play well with comparisons because comparison with them is undefined.See numpy.nanmin This is not true! Each single comparison with a NaN has a well defined outcome. The difficulty is only that certain logical assumptions do not hold any more when NaNs are involved (e.g. [AB] is not equivalent with [not(A=B)]). Assuming an IEEE compliant processor and C compiler, it should be possible to code a NaN safe min routine without additional overhead. Sadly, it's not possible without extra overhead. Specifically: the NaN-ignorant implementation does a single comparison between each array element and a placeholder, and decides based on the result which to keep. If you try to rewrite the comparison to do the right thing when a NaN is involved, you get stuck: any comparison with a NaN on either side always returns False, so you cannot distinguish between the temporary being a NaN and the new element being a non-NaN (keep the temporary) and the temporary being a non-NaN and the new element being a NaN (replace the temporary). If you're willing to do two tests, sure, but that's overhead (and probably comparable to an isnan). If you're willing to do arithmetic you might even be able to pull it off, since NaNs tend to propagate: if (newmin) min -= (min-new); Whether the speed of this is worth its impenetrability I couldn't say. What do compilers' min builtins do with NaNs? This might well be faster than an if statement even in the absence of NaNs... Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
2008/8/12 Stéfan van der Walt [EMAIL PROTECTED]: Hi Andrew 2008/8/12 Andrew Dalke [EMAIL PROTECTED]: This is buggy for the case of a list containing only NaNs. import numpy as np np.NAN nan np.min([np.NAN]) nan np.nanmin([np.NAN]) inf Thanks for the report. This should be fixed in r5630. Er, is this actually a bug? I would instead consider the fact that np.min([]) raises an exception a bug of sorts - the identity of min is inf. Really nanmin of an array containing only nans should be the same as an empty array; both should be infinity. I guess this is a problem for types that don't have an infinity (returning maxint is a poor substitute), but what is the correct behaviour here? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
2008/8/12 Joe Harrington [EMAIL PROTECTED]: So, I endorse extending min() and all other statistical routines to handle NaNs, possibly with a switch to turn it on if a suitably fast algorithm cannot be found (which is competitor IDL's solution). Certainly without a switch the default behavior should be to return NaN, not to return some random value, if a NaN is present. Otherwise the user may never know a NaN is present, and therefore has to check every use for NaNs. That constand manual NaN checking is slower and more error-prone than any numerical speed advantage. So to sum, proposed for statistical routnes: if NaN is not present, return value if NaN is present, return NaN if NaN is present and nan=True, return value ignoring all NaNs OR: if NaN is not present, return value if NaN is present, return value ignoring all NaNs if NaN is present and nan=True, return NaN I'd prefer the latter. IDL does the former and it is a pain to do /nan all the time. However, the latter might trip up the unwary, whereas the former never does. This would apply at least to: min max sum prod mean median std and possibly many others. For almost all of these the current behaviour is to propagate NaNs arithmetically. For example, the sum of anything with a NaN is NaN. I think this is perfectly sufficient, given how easy it is to strip out NaNs if that's what you want. The issue that started this thread (and the many other threads that have come up as users stub their toes on this behaviour) is that min (and other functions based on comparisons) do not propagate NaNs. If you do np.amin(A) and A contains NaNs, you can't count on getting a NaN back, unlike np.mean or np.std. the fact that you get some random value not the minimum just adds insult to injury. (It is probably also true that the value you get back depends on how the array is stored in memory.) It really isn't very hard to replace np.sum(A) with np.sum(A[~isnan(A)]) if you want to ignore NaNs instead of propagating them. So I don't feel a need for special code in sum() that treats NaN as 0. I would be content if the comparison-based functions propagated NaNs appropriately. If you did decide it was essential to make versions of the functions that removed NaNs, it would get you most of the way there to add an optional keyword argument to ufuncs' reduce method that skipped NaNs. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] confusion on importing numpy
2008/8/6 Eric Firing [EMAIL PROTECTED]: While I agree with the other posters that import * is not preferred, if you want to use it, the solution is to use amin and amax, which are provided precisely to avoid the conflict. Just as arange is a numpy analog of range, amin and amax are numpy analogs of min and max. There are actually several analogs of min and max, depending on what you want. np.minimum(a,b) takes the elementwise minimum, that is, np.minimum([1,2,3],[3,2,1]) is [1,2,1]. np.amin(a) finds the minimum value in the array a. (It's equivalent to np.minimum.reduce(a).) One reason automatically overriding min is a Bad Idea is that you will certainly shoot yourself in the foot: In [4]: numpy.min(2,1) Out[4]: 2 In [5]: min(2,1) Out[5]: 1 Note that this does not raise any kind of error, and it sometimes returns the right value. It's especially bad if you do, say, numpy.max(a,0) and expect the result to be always positive: now you get a bug in your program that only turns up with exceptional values. numpy.min is not the same as python's min, and using it as a substitute will get you in trouble. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
2008/7/15 Francesc Alted [EMAIL PROTECTED]: Maybe is only that. But by using the term 'frequency' I tend to think that you are expecting to have one entry (observation) in your array for each time 'tick' since time start. OTOH, the term 'resolution' doesn't have this implication, and only states the precision of the timestamp. Well, after reading the mails from Chris and Anne, I think the best is that the origin would be kept as an int64 with a resolution of microseconds (for compatibility with the ``datetime`` module, as I've said before). A couple of details worth pointing out: we don't need a zillion resolutions. One that's as good as the world time standards, and one that spans an adequate length of time should cover it. After all, the only reason for not using the highest available resolution is if you want to cover a larger range of times. So there is no real need for microseconds and milliseconds and seconds and days and weeks and... There is also no need for the origin to be kept with a resolution as high as microseconds; seconds would do just fine, since if necessary it can be interpreted as exactly 7000 seconds after the epoch even if you are using femtoseconds elsewhere. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy date/time types and the resolution concept
2008/7/14 Francesc Alted [EMAIL PROTECTED]: After pondering about the opinions about the first proposal, the idea we are incubating is to complement the ``datetime64`` with a 'resolution' metainfo. The ``datetime64`` will still be based on a int64 type, but the meaning of the 'ticks' would depend on a 'resolution' property. This is an interesting idea. To be useful, though, you would also need a flexible offset defining the zero of time. After all, the reason not to just always use (say) femtosecond accuracy is that 2**64 femtoseconds is only about five hours. So if you're going to use femtosecond steps, you really want to choose your start point carefully. (It's also worth noting that there is little need for more time accuracy than atomic clocks can provide, since anyone looking for more than that is going to be doing some tricky metrology anyway.) One might take guidance from the FITS format, which represents (arrays of) quantities as (usually) fixed-point numbers, but has a global scale and offset parameter for each array. This allows one to accurately represent many common arrays with relatively few bits. The FITS libraries transparently convert these quantities. Of course, this isn't so convenient if you don't have basic machine datatypes with enough precision to handle all the quantities of interest. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] RFC: A proposal for implementing some date/time types in NumPy
2008/7/11 Pierre GM [EMAIL PROTECTED]: A final note on time scales --- Wow, indeed. In environmental sciences (my side) and in finances (Matt's), we very rarely have a need for that precision, thankfully... We do, sometimes, in pulsar astronomy. But I think it's reasonable to force us to use our own custom time representations. For example, I've dealt with time series representing (X-ray) photon arrival times, expressed in modified Julian days. The effect of representing microsecond-scale quantities as differences between numbers on the order of fifty thousand days I leave to your imagination. But 64 bits was certainly not enough. More, there are even more subtle effects relating to differences between TAI and TT, issues relating to whether your times are measured at the earth's barycenter or the solar system barycenter... A date/time class that tries to do everything is quickly going to become unfeasibly complicated. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] RFC: A proposal for implementing some date/time types in NumPy
2008/7/11 Jon Wright [EMAIL PROTECTED]: Timezones are a heck of a problem if you want to be accurate. You are talking about nanosecond resolutions, however, atomic clocks in orbit apparently suffer from relativistic corrections of the order 38000 nanoseconds per day [1]. What will you do about data recorded on the international space station? Getting into time formats at this level seems to be rather complicated - there is no absolute time you can reference to - it is all relative :-) This particular issue turns up in pulsar timing; if you use X-ray observations, the satellite's orbit introduces all kinds of time variations. If you care about this you need to think about using (say) TAI, referenced to (say) the solar system barycenter (if there were no mass there). You can do all this, but I think it's out of scope for an ordinary date/time class. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] expected a single-segment buffer object
2008/7/9 Robert Kern [EMAIL PROTECTED]: Because that's just what a buffer= argument *is*. It is not a place for presenting the starting pointer to exotically-strided memory. Use __array_interface__s to describe the full range of representable memory. See below. Aha! Is this stuff documented somewhere? I was about a week ahead of you. See numpy/lib/stride_tricks.py in the trunk. Nice! Unfortunately it can't quite do what I want... for the linear algebra I need something that can broadcast all but certain axes. For example, take an array of matrices and an array of vectors. The array axes need broadcasting, but you can't broadcast on all axes without (incorrectly) turning the vector into a matrix. I've written a (messy) implementation, but the corner cases are giving me headaches. I'll let you know when I have something that works. Thanks, Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] expected a single-segment buffer object
2008/7/10 Charles R Harris [EMAIL PROTECTED]: On Thu, Jul 10, 2008 at 9:33 AM, Anne Archibald [EMAIL PROTECTED] wrote: 2008/7/9 Robert Kern [EMAIL PROTECTED]: Because that's just what a buffer= argument *is*. It is not a place for presenting the starting pointer to exotically-strided memory. Use __array_interface__s to describe the full range of representable memory. See below. Aha! Is this stuff documented somewhere? I was about a week ahead of you. See numpy/lib/stride_tricks.py in the trunk. Nice! Unfortunately it can't quite do what I want... for the linear algebra I need something that can broadcast all but certain axes. For example, take an array of matrices and an array of vectors. The array axes need broadcasting, but you can't broadcast on all axes without (incorrectly) turning the vector into a matrix. I've written a (messy) implementation, but the corner cases are giving me headaches. I'll let you know when I have something that works. I think something like a matrix/vector dtype would be another way to go for stacks of matrices and vectors. It would have to be a user defined type to fit into the current type hierarchy for ufuncs, but I think the base machinery is there with the generic inner loops. There's definitely room for discussion about how such a linear algebra system should ultimately work. In the short term, though, I think it's valuable to create a prototype system, even if much of the looping has to be in python, just to figure out how it should look to the user. For example, I'm not sure whether a matrix/vector dtype would make easy things like indexing operations - fancy indexing spanning both array and matrix axes, for example. dtypes can also be a little impenetrable to use, so even if this is how elementwise linear algebra is ultimately implemented, we may want to provide a different user frontend. My idea for a first sketch was simply to provide (for example) np.elementwise_linalg.dot_mm, that interprets its arguments both as arrays of matrices and yields an array of matrices. A second sketch might be a subclass MatrixArray, which could serve much as Matrix does now, to tell various functions which axes to do linear algebra over and which axes to iterate over. There's also the question of how to make these efficient; one could of course write a C wrapper for each linear algebra function that simply iterates, but your suggestion of using the ufunc machinery with a matrix dtype might be better. Or, if cython acquires sensible polymorphism over numpy dtypes, a cython wrapper might be the way to go. But I think establishing what it should look like to users is a good first step, and I think that's best done with sample implementations. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] huge array calculation speed
2008/7/10 Dan Lussier [EMAIL PROTECTED]: I am relatively new to numpy and am having trouble with the speed of a specific array based calculation that I'm trying to do. What I'm trying to do is to calculate the total total potential energy and coordination number of each atom within a relatively large simulation. Each atom is at a position (x,y,z) given by a row in a large array (approximately 1e6 by 3) and presently I have no information about its nearest neighbours so each its position must be checked against all others before cutting the list down prior to calculating the energy. The way you've implemented this goes as the square of the number of atoms. This is going to absolutely kill performance, and you can spend weeks trying to optimize the code for a factor of two speedup. I would look hard for algorithms that do this in less than O(n^2). This problem of finding the neighbours of an object has seen substantial research, and there are a variety of well-established techniques covering many possible situations. My knowledge of it is far from up-to-date, but since you have only a three-dimensional problem, you could try a three-dimensional grid (if your atoms aren't too clumpy) or octrees (if they are somewhat clumpy); kd-trees are probably overkill (since they're designed for higher-dimensional problems). My current numpy code is below and in it I have tried to push as much of the work for this computation into compiled C (ideally) of numpy. However it is still taking a very long time at approx 1 second per row. At this speed even some simple speed ups like weave won't really help. Are there any other numpy ways that it would be possible to calculate this, or should I start looking at going to C/C++? I am also left wondering if there is something wrong with my installation of numpy (i.e. not making use of lapack/ATLAS). Other than that if it is relevant - I am running 32 bit x86 Centos 5.1 linux on a dual Xeon 3.0 GHz Dell tower with 4 GB of memory. Unfortunately, implementing most of the algorithms in the literature within numpy is going to be fairly cumbersome. But I think there are better algorithms you could try: * Put all your atoms in a grid. Ideally the cells would be about the size of your cutoff radius, so that for each atom you need to pull out all atoms in at most eight cells for checking against the cutoff. I think this can be done fairly efficiently in numpy. * Sort the atoms along a coordinate and use searchsorted to pull out those for which that coordinate is within the cutoff. This should get you down to reasonably modest lists fairly quickly. There is of course a tradeoff between preprocessing time and lookup time. We seem to get quite a few posts from people wanting some kind of spatial data structure (whether they know it or not). Would it make sense to come up with some kind of compiled spatial data structure to include in scipy? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] alterdot and restoredot
2008/7/9 Robert Kern [EMAIL PROTECTED]: - Which operations do the functions exactly affect? It seems that alterdot sets the dot function slot to a BLAS version, but what operations does this affect? dot(), vdot(), and innerproduct() on C-contiguous arrays which are Matrix-Matrix, Matrix-Vector or Vector-Vector products. Really? Not, say, tensordot()? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] element-wise logical operations on numpy arrays
2008/7/9 Catherine Moroney [EMAIL PROTECTED]: I have a question about performing element-wise logical operations on numpy arrays. If a, b and c are numpy arrays of the same size, does the following syntax work? mask = (a 1.0) ((b 3.0) | (c 10.0)) It seems to be performing correctly, but the documentation that I've read indicates that and | are for bitwise operations, not element-by- element operations in arrays. I'm trying to avoid using logical_and and logical_or because they make the code more cumbersome and difficult to read. Are and | acceptable substitutes for numpy arrays? Yes. Unfortunately it is impossible to make python's usual logical operators, and, or, etcetera, behave correctly on numpy arrays. So the decision was made to use the bitwise operators to express logical operations on boolean arrays. If you like, you can think of boolean arrays as containing single bits, so that the bitwise operators *are* the logical operators. Confusing, but I'm afraid there really isn't anything the numpy developers can do about it, besides write good documentation. Good luck, Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Multiplying every 3 elements by a vector?
2008/7/9 Charles R Harris [EMAIL PROTECTED]: On Wed, Jul 9, 2008 at 2:34 PM, Marlin Rowley [EMAIL PROTECTED] wrote: Thanks Chuck, but I wasn't quit clear with my question. You answered exactly according to what I asked, but I failed to mention needing the dot product instead of just the product. So, v dot A = v' v'[0] = v[0]*A[0] + v[1]*A[1] + v[2]*A[2] v'[1] = v[0]*A[3] + v[1]*A[4] + v[2]*A[5] v'[2] = v[0]*A[6] + v[1]*A[7] + v[2]*A[8] There is no built in method for this specific problem (stacks of vectors and matrices), but you can make things work: sum(A.reshape((-1,3))*v, axis=1) You can do lots of interesting things using such manipulations and newaxis. For instance, multiplying stacks of matrices by stacks of matrices etc. I put up a post of such things once if you are interested. This particular instance can be viewed as a matrix multiplication (np.dot(A.reshape((-1,3)),v) I think). Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] expected a single-segment buffer object
Hi, When trying to construct an ndarray, I sometimes run into the more-or-less mystifying error expected a single-segment buffer object: Out[54]: (0, 16, 8) In [55]: A=np.zeros(2); A=A[np.newaxis,...]; np.ndarray(strides=A.strides,shape=A.shape,buffer=A,dtype=A.dtype) --- type 'exceptions.TypeError' Traceback (most recent call last) /home/peridot/ipython console in module() type 'exceptions.TypeError': expected a single-segment buffer object In [56]: A.strides Out[56]: (0, 8) That is, when I try to construct an ndarray based on an array with a zero stride, I get this mystifying error. Zero-strided arrays appear naturally when one uses newaxis, but they are valuable in their own right (for example for broadcasting purposes). So it's a bit awkward to have this error appearing when one tries to feed them to ndarray.__new__ as a buffer. I can, I think, work around it by removing all axes with stride zero: def bufferize(A): idx = [] for v in A.strides: if v==0: idx.append(0) else: idx.append(slice(None,None,None)) return A[tuple(idx)] Is there any reason for this restriction? Thanks, Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] expected a single-segment buffer object
2008/7/9 Robert Kern [EMAIL PROTECTED]: Yes, the buffer interface, at least the subset that ndarray() consumes, requires that all of the data be contiguous in memory. array_as_buffer() checks for that using PyArray_ISONE_SEGMENT(), which looks like this: #define PyArray_ISONESEGMENT(m) (PyArray_NDIM(m) == 0 || \ PyArray_CHKFLAGS(m, NPY_CONTIGUOUS) || \ PyArray_CHKFLAGS(m, NPY_FORTRAN)) Trying to get a buffer object from anything that is neither C- or Fortran-contiguous will fail. E.g. In [1]: from numpy import * In [2]: A = arange(10) In [3]: B = A[::2] In [4]: ndarray(strides=B.strides, shape=B.shape, buffer=B, dtype=B.dtype) --- TypeError Traceback (most recent call last) /Users/rkern/today/ipython console in module() TypeError: expected a single-segment buffer object Is this really necessary? What does making this restriction gain? It certainly means that many arrays whose storage is a contiguous block of memory can still not be used (just permute the axes of a 3d array, say; it may even be possible for an array to be in C contiguous order but for the flag not to be set), but how is one to construct exotic slices of an array that is strided in memory? (The real part of a complex array, say.) I suppose one could follow the linked list of .bases up to the original ndarray, which should normally be C- or Fortran-contiguous, then work out the offset, but even this may not always work: what if the original array was constructed with non-C-contiguous strides from some preexisting buffer? If the concern is that this allows users to shoot themselves in the foot, it's worth noting that even with the current setup you can easily fabricate strides and shapes that go outside the allocated part of memory. What is the use case, here? One rarely has to use the ndarray constructor by itself. For example, the result you seem to want from the call you make above can be done just fine with .view(). I was presenting a simple example. I was actually trying to use zero-strided arrays to implement broadcasting. The code was rather long, but essentially what it was meant to do was generate a view of an array in which an axis of length one had been replaced by an axis of length m with stride zero. (The point of all this was to create a class like vectorize that was suitable for use on, for example, np.linalg.inv().) But I also ran into this problem while writing segmentaxis.py, the code to produce a matrix of sliding windows. (See http://www.scipy.org/Cookbook/SegmentAxis .) There I caught the exception and copied the array (unnecessarily) if this came up. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] chararray behavior
2008/7/8 Alan McIntyre [EMAIL PROTECTED]: On Tue, Jul 8, 2008 at 1:29 PM, Travis E. Oliphant [EMAIL PROTECTED] wrote: Alan McIntyre wrote: 2. The behavior of __mul__ seems odd: What is odd about this? It is patterned after 'a' * 3 'a' * 4 'a' * 5 for regular python strings. That's what I would have expected, but for N = 4, Q*N is the same as Q*4. In particular, the returned type is always string of length four, which is very peculiar - why four? I realize that variable-length strings are a problem (object arrays, I guess?), as is returning arrays of varying dtypes (strings of length N), but this definitely violates the principle of least surprise... Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion