Re: [Numpy-discussion] 1.7.x release of NumPy
On 9/14/11 8:32 PM, Travis Oliphant wrote: but my mind is racing around NumPy 3.0 at this point. Travis, could you clarify this point? Do you mean: There are so many ideas my mind is racing around too much to even think about 3.0? or My mind is racing with ideas so much that I want to dive in and get started working on 3.0? just wondering, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] load from text files Pull Request Review
On 9/14/11 1:01 PM, Christopher Barker wrote: numpy.ndarray.resize is a different method, and I'm pretty sure it should be as fast or faster that np.empty + np.append. My profile: In [25]: %timeit f1 # numpy.resize() 1000 loops, best of 3: 163 ns per loop In [26]: %timeit f2 #numpy.ndarray.resize() 1000 loops, best of 3: 136 ns per loop In [27]: %timeit f3 # numpy.empty() + append() 1000 loops, best of 3: 136 ns per loop those last two couldn't b more identical! (though this is an excercise in unrequired optimization!) My test code: #!/usr/bin/env python test_resize A test of various numpy re-sizing options import numpy def f1(): numpy.resize l = 100 a = numpy.zeros((l,)) for i in xrange(1000): l += l a = numpy.resize(a, (l,) ) return None def f2(): numpy.ndarray.resize l = 100 a = numpy.zeros((l,)) for i in xrange(1000): l += l a.resize(a, (l,) ) return None def f3(): numpy.empty + append l = 100 a = numpy.zeros((l,)) for i in xrange(1000): b = np.empty((l,)) a.append(b) return None -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] load from text files Pull Request Review
On 9/14/11 2:41 PM, Benjamin Root wrote: Are you sure the f2 code works? a.resize() takes only a shape tuple. As coded, you should get an exception. wow, what an idiot! I think I just timed how long it takes to raise that exception... And when I fix that, I get a memory error. When I fix that, I find that f3() wasn't doing the right thing. What an astonishing lack of attention on my part! Here it is again, working, I hope! In [107]: %timeit f1() 10 loops, best of 3: 50.7 ms per loop In [108]: %timeit f2() 1000 loops, best of 3: 719 us per loop In [109]: %timeit f3() 100 loops, best of 3: 19 ms per loop So: numpy.resize() is the slowest numpy.empty+ numpy.append() is faster numpy.ndarray.resize() is the fastest Which matches my expectations, for once! -Chris The code: #!/usr/bin/env python test_resize A test of various numpy re-sizing options import numpy def f1(): numpy.resize extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): l += extra a = numpy.resize(a, (l,) ) return a def f2(): numpy.ndarray.resize extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): l += extra a.resize( (l,) ) return a def f3(): numpy.empty + append extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): b = numpy.empty((extra,)) a = numpy.append(a, b) return a a1 = f1() a2 = f2() a3 = f3() if a1.shape == a2.shape == a3.shape: print they are all returning the same size array else: print Something is wrong! -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [ANN] Cython 0.15
On 8/8/11 1:21 AM, Sebastian Haase wrote: b) What is the status of supporting multi-type Cython functions -- ala C++ templates ? You might want to take a look at what Keith Goodman has done with the Bottleneck project -- I think he used a generic template tool to generate Cython code for a variety of types from a single definition. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Reading a big netcdf file
On 8/3/11 3:56 PM, Gökhan Sever wrote: Back to the reality. After clearing the cache using Warren's suggestion: In [1]: timeit -n1 -r1 a = np.fromfile('temp.npa', dtype=np.uint16) 1 loops, best of 1: 7.23 s per loop yup -- that cache sure can be handy! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Reading a big netcdf file
On 8/4/11 10:02 AM, Christopher Barker wrote: On 8/4/11 8:53 AM, Jeff Whitaker wrote: Kiko: I think the difference may be that when you read the data with netcdf4-python, it tries to unpack the short integers to a float32 array. Jeff, why is that? is it an netcdf4 convention? I always thought that the netcdf data model matched numpy's quite well, including the clear choice and specification of data type. I guess I've mostly used float data anyway, so hadn't noticed this, but ti comes as a surprise to me! gebco4.set_automaskandscale(False) OK -- looked at this a bit more, and see in the OP's ncdump: variables: short z(xysize) ; z:scale_factor = 1. ; z:add_offset = 0. ; z:node_offset = 0 ; so I presume netCDF4 is seeing the scale_factor and offsets, and thus converting to float. In this case, the scale factor is 1.0, and the offsets are 0.0, so there isn't any need to convert, but that may be too smart! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Finding many ways to incorrectly create a numpy array. Please advice
On 8/2/11 1:40 PM, Jeremy Conlin wrote: Thanks for that information. It helps greatly in understanding what is happening. Next time I'll put my data into tuples. I don't remember where they all are, but there are a few places in numpy where tuples and lists are interpreted differently (fancy indexing?). It kind of breaks python duck typing (a sequence is a sequence), but it's useful, too. So when a list fails to do what you want, try a tuple. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Reading a big netcdf file
On 8/3/11 9:30 AM, Kiko wrote: I'm trying to read a big netcdf file (445 Mb) using netcdf4-python. I've never noticed that netCDF4 was particularly slow for reading (writing can be pretty slow some times). How slow is slow? The data are described as: please post the results of: ncdump -h the_file_name.nc So we can see if there is anything odd in the structure (though I don't know what it might be) Post your code (in the simnd pplest form you can). and post your timings and machine type Is the file netcdf4 or 3 format? (the python lib will read either) As a reference, reading that much data in from a raw file into a numpy array takes 2.57 on my machine (a rather old Mac, but disks haven't gotten much faster). YOu can test that like this: a = np.zeros((21601, 10801), dtype=np.uint16) a.tofile('temp.npa') del a timeit a = np.fromfile('temp.npa', dtype=np.uint16) (using ipython's timeit) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Reading a big netcdf file
On 8/3/11 11:09 AM, Ian Stokes-Rees wrote: On 8/3/11 12:50 PM, Christopher Barker wrote: As a reference, reading that much data in from a raw file into a numpy array takes 2.57 on my machine (a rather old Mac, but disks haven't gotten much faster). 2.57 seconds? or minutes? sorry -- seconds. If seconds, does it actually read the whole thing into memory in that time, or is there some kind of delayed read going on? I think it reads it all in. However, now that you bring it up, I think timeit does it a few times, and after the first time, there may well be disk cache that speeds things up. In fact, as I recently wrote the file, there may be disk cache issues even on the first read. I'm no timing expert, but there must be ways to get a clean time. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Reading a big netcdf file
On 8/3/11 9:46 AM, Gökhan Sever wrote: In [23]: from netCDF4 import Dataset In [24]: f = Dataset(test.nc http://test.nc) In [25]: f.variables['reflectivity'].shape Out[25]: (6, 18909, 506) In [26]: f.variables['reflectivity'].size Out[26]: 57407724 In [27]: f.variables['reflectivity'][:].dtype Out[27]: dtype('float32') In [28]: timeit z = f.variables['reflectivity'][:] 1 loops, best of 3: 731 ms per loop that seems pretty fast, actually -- are you sure that [:] forces the full data read? It probably does, but I'm not totally sure. is z a numpy array object at that point? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Reading a big netcdf file
On 8/3/11 1:57 PM, Gökhan Sever wrote: This is what I get here: In [1]: a = np.zeros((21601, 10801), dtype=np.uint16) In [2]: a.tofile('temp.npa') In [3]: del a In [4]: timeit a = np.fromfile('temp.npa', dtype=np.uint16) 1 loops, best of 3: 251 ms per loop so that's about 10 times faster than my machine. I didn't think disks had gotten much faster -- they are still generally 7200 rpm (or slower in laptops). So I've either got a really slow disk, or you have a really fast one (or both), or maybe you're getting cache effect, as you wrote the file just before reading it. repeating, doing just what you did: In [8]: timeit a = np.fromfile('temp.npa', dtype=np.uint16) 1 loops, best of 3: 2.53 s per loop then I wrote a bunch of others to disk, and tried again: In [17]: timeit a = np.fromfile('temp.npa', dtype=np.uint16) 1 loops, best of 3: 2.45 s per loop so ti seems I'm not seeing cache effects, but maybe you are. Anyway, we haven't heard from the OP -- I'm not sure what s/he thought was slow. -Chris On Wed, Aug 3, 2011 at 10:50 AM, Christopher Barker chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov wrote: On 8/3/11 9:30 AM, Kiko wrote: I'm trying to read a big netcdf file (445 Mb) using netcdf4-python. I've never noticed that netCDF4 was particularly slow for reading (writing can be pretty slow some times). How slow is slow? The data are described as: please post the results of: ncdump -h the_file_name.nc http://the_file_name.nc So we can see if there is anything odd in the structure (though I don't know what it might be) Post your code (in the simnd pplest form you can). and post your timings and machine type Is the file netcdf4 or 3 format? (the python lib will read either) As a reference, reading that much data in from a raw file into a numpy array takes 2.57 on my machine (a rather old Mac, but disks haven't gotten much faster). YOu can test that like this: a = np.zeros((21601, 10801), dtype=np.uint16) a.tofile('temp.npa') del a timeit a = np.fromfile('temp.npa', dtype=np.uint16) (using ipython's timeit) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR (206) 526-6959 tel:%28206%29%20526-6959 voice 7600 Sand Point Way NE (206) 526-6329 tel:%28206%29%20526-6329 fax Seattle, WA 98115 (206) 526-6317 tel:%28206%29%20526-6317 main reception chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Finding many ways to incorrectly create a numpy array. Please advice
On 8/2/11 8:38 AM, Jeremy Conlin wrote: Thanks, Brett. Using StringIO and numpy.loadtxt worked great. I'm still curious why what I was doing didn't work. Everything I can see indicates it should work. In [11]: tfc_dtype Out[11]: dtype([('nps', 'u8'), ('t', 'f8'), ('e', 'f8'), ('fom', 'f8')]) In [15]: n = numpy.fromstring(l, dtype=tfc_dtype, sep=' ') --- ValueErrorTraceback (most recent call last) /Users/cbarker/ipython console in module() ValueError: don't know how to read character strings with that array type means just what it says. In theory, numpy.fromstring() (and fromfile() ) provides a way to quickly and efficiently generate arrays from text, but it practice, the code is quite limited (and has a bug or two). I don't think anyone has gotten around to writing the code to use structured dtypes with it -- so it can't do what you want (rational though that expectation is) In [21]: words Out[21]: ['32000', '7.89131E-01', '8.05999E-03', '3.88222E+03'] In [22]: p = Display all 249 possibilities? (y or n) In [22]: p = numpy.array(words, dtype=tfc_dtype) In [23]: p Out[23]: array([(3689064028291727360L, 0.0, 0.0, 0.0), (3976177339304456517L, 4.967820413490985e-91, 0.0, 0.0), (4048226120204106053L, 4.970217431784588e-91, 0.0, 0.0), (3687946958874489413L, 1.1572189237420885e-100, 0.0, 0.0)], dtype=[('nps', 'u8'), ('t', 'f8'), ('e', 'f8'), ('fom', 'f8')]) similarly here -- converting from text to structured dtypes is not fully supported In [29]: a Out[29]: [32000, 0.789131, 0.00805999, 3882.22] In [30]: r = numpy.array(a) In [31]: r Out[31]: array([ 3.2000e+04, 7.89131000e-01, 8.05999000e-03, 3.88222000e+03]) sure -- numpy's default behavior is to find a dtype that will hold all the input array -- this pre-dates structured dtypes, and probably what you would want b default anyway. In [32]: s = numpy.array(a, dtype=tfc_dtype) --- TypeError Traceback (most recent call last) /Users/cbarker/ipython console in module() TypeError: expected a readable buffer object OK -- I can see why you'd expect that to work. However, the trick with structured dtypes is that the dimensionality of the inputs can be less than obvious -- you are passing in a 1-d list of 4 numbers -- do you want a 1-d array? or ? -- in this case, it's pretty obvious (as a human) what you would want -- you have a dtype with four fields, and you're passing in four numbers, but there are so many possible combinations that numpy doesn't try to be smart about it. So as a rule, you need to be quite specific when working with structured dtypes. However, the default is for numpy to map tuples to dtypes, so if you pass in a tuple instead, it works: In [34]: t = tuple(a) In [35]: s = numpy.array(t, dtype=tfc_dtype) In [36]: s Out[36]: array((32000L, 0.789131, 0.00805999, 3882.22), dtype=[('nps', 'u8'), ('t', 'f8'), ('e', 'f8'), ('fom', 'f8')]) you were THIS close! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Segmentation Fault in Numpy.test()
On 8/2/11 10:14 AM, Thomas Markovich wrote: Just to recap, the issue appears to stem from using the scipy superpack with python 2.7 from python.org http://python.org. This was solved by using the apple python along with the scipy superpack. This sure sounds like a bug in the sciy superpack installer -- if it was build for the system python2.6, it should not get installed into 2.7. Unless you did something to force that. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recommendation for saving data
On 7/31/11 5:48 AM, Brian Blais wrote: I was wondering if there are any recommendations for formats for saving scientific data. every field has it's own standards -- I'd try to find one that is likely to be used by folks that may care about your results. For Oceanographic and Atmospheric modeling data, netcdf is a good option. I like the NetCDF4 python lib: http://code.google.com/p/netcdf4-python/ (there are others) For broader use, and a bit more flexibility, HDF is a good option. There are at least two ways to use it with numpy: PyTables: http://www.pytables.org (Nice higher-level interface) hf5py: http://alfven.org/wp/hdf5-for-python/ (a more raw HDF5 wrapper) There is also the npz format, built in to numpy, if you are happy with requiring python to read the data. -Chris I am running a simulation, which has many somewhat-indepedent parts which have their own internal state and parameters. I've been using pickle (gzipped) to save the entire object (which contains subobjects, etc...), but it is getting too unwieldy and I think it is time to look for a more robust solution. Ideally I'd like to have something where I can call a save method on the simulation object, and it will call the save methods on all the children, on down the line all saving into one file. It'd also be nice if it were cross-platform, and I could depend on the files being readable into the future for a while. Are there any good standards for this? What do you use for saving scientific data? thank you, Brian Blais -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Problems with numpy binary for Python2.7 + OS X 10.6
On 7/25/11 1:00 PM, Ian Stokes-Rees wrote: As best I can tell, I have Python 2.7.2 for my system Python: [ijstokes@moose ~]$ python -V Python 2.7.2 [ijstokes@moose ~]$ which python /Library/Frameworks/Python.framework/Versions/2.7/bin/python yup -- that is probably the python.org python. However, there are now two different builds of 2.7 for OS-X the 10.3 one, which is 32 bit, PPC+Intel, 10.3.9 and above, and the 10.6 build, which is 32bit_64bit, Intel only, and only runs on 10.6 and above. however when I attempt to install the recent numpy binary python-2.7.2-macosx10.6.dmg I get stopped at the first stage of the install procedure with the error: numpy 1.6.1 can't be installed on this disk. numpy requires System Python 2.7 to install. You need a numpy build that matches the python build, I suspect you have a mis-match. Check carefully which ones you downloaded and make sure they match. Is it looking for /usr/bin/python2.7? For that, I only have up to 2.6 available. (and 2.5) No. The System Python term is a misnomer that somehow has never gotten fixed in the installer -- what it is really looking for is the python.org build. HTH, -Chris Cheers, Ian ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array vectorization in numpy
Carlos Becker wrote: Besides the matlab/numpy comparison, I think that there is an inherent problem with how expressions are handled, in terms of efficiency. For instance, k = (m - 0.5)*0.3 takes 52msec average here (2000x2000 array), while k = (m - 0.5)*0.3*0.2 takes 0.079, and k = (m - 0.5)*0.3*0.2*0.1 takes 101msec. Placing parentheses around the scalar multipliers shows that it seems to have to do with how expressions are handled, is there sometihng that can be done about this so that numpy can deal with expressions rather than single operations chained by python itself? well, it is Python, and Python itself does not know anything about array math -- so you need to be careful to do that correctly yourself. Python aside, understanding how parentheses effect computation, even for algebraically equal expressions is a good thing to understand. Aside from issues with scalars, a Python expression like: a = a * b * c does: multiply a and b and put it in a temporary multiply that temporary by c and put that in a temporary assign the final temporary to a so it does, in fact, create two unnecessary temporaries for this simple expression. If you are concerned about performance, there are ways to control that. in-place operators is one: a *= b a *= c will not create any temporaries, and will probably be faster. It still does two loops through the data, though. If your arrays are too big to fit in cache, that could effect performance. To get around that you need to get fancy. One option is numexpr: http://code.google.com/p/numexpr/ numexpr takes an entire expression as input, and can thus optimize some of this at the expression level, make good use of cache, etc. -- it's pretty cool. There are a few other options, including weave, of course. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Christopher Jordan-Squire wrote: Here's a short-ish summary of the topics discussed in the conference call this afternoon. Thanks, this is great! And thanks to all who participated in the call. 3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE, 3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. whoooa! I actually have been looking for, and thinking about jagged arrays a fair bit lately, so this is kind of exciting, but this looks like a bad idea to me. The above indicates that: a = np.array( [ [1, 2, np.IGNORE], [np.IGNORE, 3, 4] ] a[:,1] would yield: array([2, 4]) which seems really wrong -- you've tossed out the location information altogether. ( think it should be: array([2, 3]) I could see a jagged array being represented by IGNOREs all at the END of each row, but putting items in the middle, and shifting things to the left strikes me as a plain old bad idea (and a pain to implement) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Dag Sverre Seljebotn wrote: Here's an HPC perspective...: At least I feel that the transparency of NumPy is a huge part of its current success. Many more than me spend half their time in C/Fortran and half their time in Python. Absolutely -- and this point has been raised a couple times in the discussion, so I hope it is not forgotten. I tend to look at NumPy this way: Assuming you have some data in memory (possibly loaded by a C or Fortran library). (Almost) no matter how it is allocated, ordered, packed, aligned -- there's a way to find strides and dtypes to put a nice NumPy wrapper around it and use the memory from Python. and vice-versa -- Assuming you have some data in numpy arrays, there's a way to process it with a C or Fortran library without copying the data. And this is where I am skeptical of the bit-pattern idea -- while one can expect C and fortran and GPU, and ??? to understand NaNs for floating point data, is there any support in compilers or hardware for special bit patterns for NA values to integers? I've never seen in my (very limited experience). Maybe having the mask option, too, will make that irrelevant, but I want to be clear about that kind of use case. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using the same vocabulary for missing value ideas
Mark Wiebe wrote: 1) NA vs IGNORE and bitpattern vs mask are completely independent. Any combination of NA as bitpattern, NA as mask, IGNORE as bitpattern, and IGNORE as mask are reasonable. Is this really true? if you use a bitpattern for IGNORE, haven't you just lost the ability to get the original value back if you want to stop ignoring it? Maybe that's not inherent to what an IGNORE means, but it seems pretty key to me. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NA/Missing Data Conference Call Summary
Christopher Jordan-Squire wrote: If we follow those rules for IGNORE for all computations, we sometimes get some weird output. For example: [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix multiply and not * with broadcasting.) Or should that sort of operation through an error? That should throw an error -- matrix computation is heavily influenced by the shape and size of matrices, so I think IGNORES really don't make sense there. Nathaniel Smith wrote: It's exactly this transparency that worries Matthew and me -- we feel that the alterNEP preserves it, and the NEP attempts to erase it. In the NEP, there are two totally different underlying data structures, but this difference is blurred at the Python level. The idea is that you shouldn't have to think about which you have, but if you work with C/Fortran, then of course you do have to be constantly aware of the underlying implementation anyway. I don't think this bothers me -- I think it's analogous to things in numpy like Fortran order and non-contiguous arrays -- you can ignore all that when working in pure python when performance isn't critical, but you need a deeper understanding if you want to work with the data in C or Fortran or to tune performance in python. So as long as there is an API to query and control how things work, I like that it's hidden from simple python code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing/accumulating data
Mark Wiebe wrote: Speaking of which, would we make the NA value be false? For booleans, it works out like this: http://en.wikipedia.org/wiki/Ternary_logic#Kleene_logic That's pretty cool! In R, trying to test the truth value of NA (if (NA) ...) raises an exception. Adopting this behavior seems reasonable to me. I'm not so sure. the other president is Python, where None is interpreted as False. In general, in non-numpy code, I use None to mean not set yet or I'm not sure, or, whatever. It's pretty useful to have it be false. However, I also do: if x is not None: rather than- if x: so as to be unambiguous about what I'm testing for (and because if x == 0, I don't want the test to fail), so I guess: if arr[i] is np.NA: would be perfectly analogous. -Chris -Mark -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 tel:%28206%29%20526-6959 voice 7600 Sand Point Way NE (206) 526-6329 tel:%28206%29%20526-6329 fax Seattle, WA 98115 (206) 526-6317 tel:%28206%29%20526-6317 main reception chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] suggestions on optimising a special matrix reshape
qu...@gmx.at wrote: i have to reshape a matrix beta of the form (4**N, 4**N, 4**N, 4**N) into betam like (16**N, 16**N) following: betam = np.zeros((16**N,16**N), dtype = complex) for k in xrange(16**N): ind1 = np.mod(k,4**N) ind2 = k/4**N for l in xrange(16**N): betam[k,l] = beta[np.mod(l,4**N), l/4**N, ind1 , ind2] is there a smarter/faster way of getting the above done? no time to check if this is what you want, but is this it? a = np.arange((4**(4*N))).reshape(4**N,4**N,4**N,4**N) b = a.reshape((16**N, 16**N)) If that doesn't do it right, you may be able to mess with the strides, etc. do some googling, and check out: numpy.lib.stride_tricks -Chris for N=2, that already takes 0.5 seconds but i intend to use it for N=3 and N=4 ... thanks for your input, q -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] alterNEP - was: missing data discussion round 2
Matthew Brett wrote: should raise an error. On the other hand, if I make a normal array: arr = np.array([1.0, 2.0, 7.0]) and then do this: arr.visible[2] = False then either I should raise an error (it's not a masked array), or, more magically, construct a mask on the fly. maybe it's too much Magic, but it seems reasonable to me that for an array without a mask, arr.visible[i] is simply True for all values of i -- no need to create a mask to determine that. does arr[i] = np.IGNORE auto-create a mask if there is not one there already? I think it should. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Missing/accumulating data
Joe Harrington wrote: All that has to happen is to allow the sense of the mask to be FALSE = the data are bad, TRUE = the data are good, and allow (not require) the mask to be of any numerical type, or at least of integer type as well as boolean. quick note on this: I like the FALSE == good way, because: instead of good and bad we think masked and unmasked, then we have: False = unmasked = regular old data True = masked = something special about the data The default for something special is bad (or missing , or ignore), but the cool thing is that if you use an int: 0 = unmasked 1 = masked because of one thing 2 = masked because of another etc., etc. This could be pretty powerful -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Nathaniel Smith wrote: The 'dtype factory' idea builds on the way I've structured datetime as a parameterized type, ... Another disadvantage is that we get further from Gael Varoquaux's point: Right now, the numpy array can be seen as an extension of the C array, basically a pointer, a data type, and a shape (and strides). This enables easy sharing with libraries that have not been written with numpy in mind. and also PEP 3118 support It is very useful that a numpy array has a pointer to a regular old C array -- if we introduce this special dtype, that will break (well, not really, put the the c array would be of this particular struct). Granted, any other C code would properly have to do something with the mask anyway, but I still think it'd be better to keep that raw data array standard. This applies to switching between masked and not-masked numpy arrays also -- I don't think I'd want the performance hot of that requiring a data copy. Also the idea was posted here that you could use views to have the same data set with different masks -- that would break as well. Nathaniel Smith wrote: If we think that the memory overhead for floating point types is too high, it would be easy to add a special case where maybe(float) used a distinguished NaN instead of a separate boolean. That would be pretty cool, though in the past folks have made a good argument that even for floats, masks have significant advantages over just using NaN. One might be that you can mask and unmask a value for different operations, without losing the value. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Robert Kern wrote: It's worth noting that this is not a replacement for masked arrays, nor is it intended to be the be-all, end-all solution to missing data problems. It's mostly just intended to be a focused tool to fill in the gaps where masked arrays are less convenient for whatever reason; I think we have enough problems with ndarray vs numpy.ma -- this is introducing a third option? IMHO, and from the discussion, it seems this proposal should be a uniter, not a divider. While masked arrays may still exist, it would be nice if they were an extension of the new built-in thingie, not yet another implementation. Not every dtype would have an NA-aware counterpart. One of the great things about numpy is the full range of data types. I think it would be surprising and frustrating not to have masked versions of them all. By the way, what might be the performance hit of a new dtype -- wouldn't we lose all sort of opportunities for the compiler and hardware to optimize? I can only image an if statement with every single computation. But maybe that isn't any more of a hit that a separate mask. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] feedback request: proposal to add masks to the core ndarray
Nathaniel Smith wrote: IME it's important for the default behavior to be to fail loudly when things are wonky, not to silently patch them up, possibly incorrectly!) +1 for default behavior to fail, raise errors, or return unknown values is these situations. Though that behavior should be over-ridable with a flag or addition functions (like nan-sum) Robert Kern wrote: How your proposal improves on the status quo, i.e. numpy.ma. I've thought for years that masked arrays should be better integrated with numpy -- it seems there is a lot of code duplication and upkeep required to keep masked arrays working as much like regular arrays as possible. Particularly having third-party code work with them. And I'm not sure I can explain why, but I find myself hardly ever using masked arrays -- it just feels like added complication to use them -- I tend to work with floats and use NaN instead, even though it's often not as good an option. Gael Varoquaux wrote: Right now, the numpy array can be seen as an extension of the C array, basically a pointer, a data type, and a shape (and strides). This enables easy sharing with libraries that have not been written with numpy in mind. Isn't that what the PEP 3118 extended buffer protocol is supposed to be for? Anyway, good stuff! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast numpy i/o
Neal Becker wrote: I'm wondering what are good choices for fast numpy array serialization? mmap: fast, but I guess not self-describing? hdf5: ? Should be pretty fast, and self describing -- advantage of being a standard. Disadvantage is that it requires an hdf5 library, which can b a pain to install on some systems. pickle: self-describing, but maybe not fast? others? there is .tofile() and .fromfile() -- should be about as fast as you can get, not self-describing. Then there is .save(), savez() and .load() (.npz format)? It should be pretty fast, and self-describing (but not a standard outside of numpy). I doubt pickle will ever be your best bet. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast numpy i/o
Neal Becker wrote: I'm wondering what are good choices for fast numpy array serialization? mmap: fast, but I guess not self-describing? hdf5: ? pickle: self-describing, but maybe not fast? others? I think, in addition, that hdf5 is the only one that easily interoperates with matlab? netcdf is another option if you want an open standard supported by Matlab and the like. I like the netCDF4 package: http://code.google.com/p/netcdf4-python/ Though is can be kind of slow to write (I have no idea why!) plain old tofile() should be readable by other tools as well, as long as you have a way to specify what is in the file. speaking of hdf5, I see: pyhdf5io 0.7 - Python module containing high-level hdf5 load and save functions. h5py 2.0.0 - Read and write HDF5 files from Python There is also pytables, which uses HDF5 under the hood. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast numpy i/o
Robert Kern wrote: https://raw.github.com/numpy/numpy/master/doc/neps/npy-format.txt Just a note. From that doc: HDF5 is a complicated format that more or less implements a hierarchical filesystem-in-a-file. This fact makes satisfying some of the Requirements difficult. To the author's knowledge, as of this writing, there is no application or library that reads or writes even a subset of HDF5 files that does not use the canonical libhdf5 implementation. I'm pretty sure that the NetcdfJava libs, developed by Unidata, use their own home-grown code. netcdf4 is built on HDF5, so that qualifies as a library that reads or writes a subset of HDF5 files. Perhaps there are lessons to be learned there. (too bad it's Java) Furthermore, by providing the first non-libhdf5 implementation of HDF5, we would be able to encourage more adoption of simple HDF5 in applications where it was previously infeasible because of the size of the library. I suppose this point is still true -- a C lib that supported a subset of hdf would be nice. That being said, I like the simplicity of the .npy format, and I don't know that anyone wants to take any of this on anyway. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast grayscale conversion
Alex Flint wrote: Thanks, that's helpful. I'm now getting comparable times on a different machine, it must be something else slowing down my machine more generally, not just numpy. you also might want to get a bit fancier than simply scaling linearly R,G, and B don't necessarily all contribute equally to our sense of whiteness For instance, PIL uses: When from a colour image to black and white, the library uses the ITU-R 601-2 luma transform: L = R * 299/1000 + G * 587/1000 + B * 114/1000 which would be easy enough to do with numpy. -Chris On Mon, Jun 20, 2011 at 5:11 PM, Eric Firing efir...@hawaii.edu mailto:efir...@hawaii.edu wrote: On 06/20/2011 10:41 AM, Zachary Pincus wrote: You could try: src_mono = src_rgb.astype(float).sum(axis=-1) / 3. But that speed does seem slow. Here are the relevant timings on my machine (a recent MacBook Pro) for a 3.1-megapixel-size array: In [16]: a = numpy.empty((2048, 1536, 3), dtype=numpy.uint8) In [17]: timeit numpy.dot(a.astype(float), numpy.ones(3)/3.) 10 loops, best of 3: 116 ms per loop In [18]: timeit a.astype(float).sum(axis=-1)/3. 10 loops, best of 3: 85.3 ms per loop In [19]: timeit a.astype(float) 10 loops, best of 3: 23.3 ms per loop On my slower machine (older laptop, core2 duo), you can speed it up more: In [3]: timeit a.astype(float).sum(axis=-1)/3.0 1 loops, best of 3: 235 ms per loop In [5]: timeit b = a.astype(float).sum(axis=-1); b /= 3.0 1 loops, best of 3: 181 ms per loop In [7]: timeit b = a.astype(np.float32).sum(axis=-1); b /= 3.0 10 loops, best of 3: 148 ms per loop If you really want float64, it is still faster to do the first operation with single precision: In [8]: timeit b = a.astype(np.float32).sum(axis=-1).astype(np.float64); b /= 3.0 10 loops, best of 3: 163 ms per loop Eric On Jun 20, 2011, at 4:15 PM, Alex Flint wrote: At the moment I'm using numpy.dot to convert a WxHx3 RGB image to a grayscale image: src_mono = np.dot(src_rgb.astype(np.float), np.ones(3)/3.); This seems quite slow though (several seconds for a 3 megapixel image) - is there a more specialized routine better suited to this? Cheers, Alex ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] get a dtypes' range?
Hi all, I'm wondering if there is a way to get the range of values a given dtype can hold? essentially, I'm looking for something like sys.maxint, for for whatever numpy dtype I have n hand (and eps for floating point types). I was hoping there would be something like a_dtype.max_value a_dtype.min_value etc. In this case, I have a uint16, so I can hard code it, but it would be nice to be able to be write the code in a more generic fashion. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] get a dtypes' range?
Fabrice Silva wrote: I'm wondering if there is a way to get the range of values a given dtype can hold? http://docs.scipy.org/doc/numpy/reference/routines.dtype.html#data-type-information yup, that does it, I hadn't found that, as I was expecting a more OO way, i.e. as a method of the dtype. Also, I need to know in advance if it's a int or a float, but I suppose you can't really do much without knowing that anyway. Actually, I'm a bit confused about dtypes from an OO design perspective anyway. I note that the dtypes seem to have all (most?) of the methods of ndarrays (or placeholders, anyway), which I don't quite get. oh well, I've found what I need, thanks. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] get a dtypes' range?
Robert Kern wrote: On Fri, Jun 17, 2011 at 13:27, Christopher Barker chris.bar...@noaa.gov wrote: Actually, I'm a bit confused about dtypes from an OO design perspective anyway. I note that the dtypes seem to have all (most?) of the methods of ndarrays (or placeholders, anyway), which I don't quite get. No, they don't. |33 d = np.dtype(float) [~] |34 dir(d)... The numpy *scalar* types do, OK -- that is, indeed, what I observed, from: dir(np.uint16) ... because they are actual Python types just like any other. They provide the *unbound* methods (i.e. you cannot call them) that will be bound when an instance of it gets created. And the scalar types have many of the same methods as ndarray to allow some ability to write generic functions that will work with either arrays or scalars. That's handy. I guess I'm confused about the difference between: np.dtype(uint16) and np.uint16 since I always pass in the later when a dtype is called for. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication
NOTE: I'm only taking part in this discussion because it's interesting and I hope to learn something. I do hope the OP chimes back in to clarify his needs, but in the meantime... Bruce Southey wrote: Remember that is what the OP wanted to do, not me. Actually, I don't think that's what the OP wanted -- I think we have a conflict between the need for concrete examples, and the desire to find a generic solution, so I think this is what the OP wants: How to best multiprocess a _generic_ operation that needs to be performed on a lot of arrays. Something like: output = [] for a in a_bunch_of_arrays: output.append( a_function(a) ) More specifically, a_function() is an inner product, *defined by the user*. So there is no way to optimize the inner product itself (that will be up to the user), nor any way to generally convert the bunch_of_arrays to a single array with a single higher-dimensional operation. In testing his approach, the OP used a numpy multiply, and a simple, loop-through-the elements multiply, and found that with his multiprocessing calls, the simple loop was a fair bit faster with two processors, but that the numpy one was slower with two processors. Of course, the looping method was much, much, slower than the numpy one in any case. So Sturla's comments are probably right on: Sturla Molden wrote: innerProductList = pool.map(myutil.numpy_inner_product, arrayList) 1. Here we potentially have a case of false sharing and/or mutex contention, as the work is too fine grained. pool.map does not do any load balancing. If pool.map is to scale nicely, each work item must take a substantial amount of time. I suspect this is the main issue. 2. There is also the question of when the process pool is spawned. Though I haven't checked, I suspect it happens prior to calling pool.map. But if it does not, this is a factor as well, particularly on Windows (less so on Linux and Apple). It didn't work well on my Mac, so ti's either not an issue, or not Windows-specific, anyway. 3. arrayList is serialised by pickling, which has a significan overhead. It's not shared memory either, as the OP's code implies, but the main thing is the slowness of cPickle. I'll bet this is a big issue, and one I'm curious about how to address, I have another problem where I need to multi-process, and I'd love to know a way to pass data to the other process and back *without* going through pickle. maybe memmapped files? IPs = N.array(innerProductList) 4. numpy.array is a very slow function. The benchmark should preferably not include this overhead. I re-ran, moving that out of the timing loop, and, indeed, it helped a lot, but it still takes longer with the multi-processing. I suspect that the overhead of pickling, etc. is overwhelming the operation itself. That and the load balancing issue that I don't understand! To test this, I did a little experiment -- creating a fake operation, one that simply returns an element from the input array -- so it should take next to no time, and we can time the overhead of the pickling, etc: $ python shared_mem.py Using 2 processes No shared memory, numpy array multiplication took 0.124427080154 seconds Shared memory, numpy array multiplication took 0.586215019226 seconds No shared memory, fake array multiplication took 0.000391006469727 seconds Shared memory, fake array multiplication took 0.54935503006 seconds No shared memory, my array multiplication took 23.5055780411 seconds Shared memory, my array multiplication took 13.0932741165 seconds Bingo! The overhead of the multi-processing takes about .54 seconds, which explains the slowdown for the numpy method not so mysterious after all. Bruce Southey wrote: But if everything is *single-threaded* and thread-safe, then you just create a function and use Anne's very useful handythread.py (http://www.scipy.org/Cookbook/Multithreading). This may be worth a try -- though the GIL could well get in the way. By the way, if the arrays are sufficiently small, there is a lot of overhead involved such that there is more time in communication than computation. yup -- clearly the case here. I wonder if it's just array size though -- won't cPickle time scale with array size? So it may not be size pe-se, but rather how much computation you need for a given size array. -Chris [I've enclosed the OP's slightly altered code] -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov myutil.py Description: application/python shared_mem.py Description: application/python ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication
Sturla Molden wrote: On Windows this sucks, because there is no fork system call. Darn -- I do need to support Windows. Here we are stuck with multiprocessing and pickle, even if we use shared memory. What do you need to pickle if you're using shared memory? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication
Sturla Molden wrote: Den 16.06.2011 19:55, skrev Sturla Molden: The meta-data for the ndarray (strides, shape, dtype, etc.) and the name of the shared memory segment. As there is no fork, we must tell the other process where to find the shared memory and what to do with it. got it, thanks. Which by the way is what the shared memory arrays I and Gaël made will do, but we still have the annoying pickle overhead. Do you have a pointer to that code? It sounds handy. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication
Sturla Molden wrote: IMHO, trying to beat Intel or AMD performance library developers with Python, NumPy and multiprocessing is just silly. Nothing we do with array operator * and np.sum is ever going to compare with BLAS functions from these libraries. I think the issue got confused -- the OP was not looking to speed up a matrix multiply, but rather to speed up a whole bunch of independent matrix multiplies. Sometimes we need a little bit more course-grained parallelism. Then it's time to think about Python threads and releasing the GIL or use OpenMP with C or Fortran. multiprocessing is the last tool to think about. It is mostly approproate for 'embarassingly parallel' paradigms, and certainly not the tool for parallel matrix multiplication. I think this was, in fact, an embarrassingly parallel example. But when the OP put it on two processors, it was slower than one -- hence his question. I got the same result on my machine as well. I'm not sure he tried python threads, that may be worth a shot. It would also would be great if someone that actually understands this stuff could look at his code and explain why the slowdown occurs (hint, hint!) -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.histogram: upper range bin
Peter Butterworth wrote: Consistent bin width is important for my applications. With floating point numbers I usually shift my bins by a small offset to ensure values at bin edges always fall in the correct bin. With the current np.histogram behavior you _silently_ get a wrong count in the top bin if a value falls on the upper bin limit. Incidentally this happens by default with integers. ex: x=range(4); np.histogram(x) Again, the trick here is that histogram really is for floating point numbers, not integers. Likely I will test for the following condition when using np.histogram(x): max(x) == top bin limit from the docstring: range : (float, float), optional The lower and upper range of the bins. If not provided, range is simply ``(a.min(), a.max())``. So, in fact, if you don't specify the range, the top of the largest bin will ALWAYS be the max value in your data. You seem to be advocating that that value not be included in any bins -- which would surely not be a good option. With the current np.histogram behavior you _silently_ get a wrong count in the top bin if a value falls on the upper bin limit. It is not a wrong count -- it is absolutely correct. I think the issue here is that you are binning integers, which is really a categorical binning, not a value-based one. If you want categorical binning, there are other ways to do that. I suppose having np.histogram do something different if the input is integers might make sense, but that's probably not worth it. By the way, what would you have it do it you had integers, but a large number, over a large range of values: In [68]: x = numpy.random.randint(0, 1, size=(100,) ) In [70]: np.histogram(x) Out[70]: (array([10, 10, 12, 16, 12, 9, 11, 4, 9, 7]), array([ 712131., 10437707.4 , 20163283.8 , 2960.2 , 39614436.6 , 49340013., 59065589.4001, 68791165.8 , 78516742.2 , 88242318.6001, 97967895.])) I guess it is better to always specify the correct range, but wouldn't it be preferable if the function provided a warning when this case occurs ? --- Re: [Numpy-discussion] np.histogram: upper range bin Christopher Barker Thu, 02 Jun 2011 09:19:16 -0700 Peter Butterworth wrote: in np.histogram the top-most bin edge is inclusive of the upper range limit. As documented in the docstring (see below) this is actually the expected behavior, but this can lead to some weird enough results: In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) Out[72]: (array([1, 1, 2]), array([ 1., 2., 3., 4.])) Is there any way round this or an alternative implementation without this issue ? The way around it is what you've identified -- making sure your bins are right. But I think the current behavior is the way it should be. It keeps folks from inadvertently loosing stuff off the end -- the lower end is inclusive, so the upper end should be too. In the middle bins, one has to make an arbitrary cut-off, and put the values on the line somewhere. One thing to keep in mind is that, in general, histogram is designed for floating point numbers, not just integers -- counting integers can be accomplished other ways, if that's what you really want (see np.bincount). But back to your example: In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) Why do you want only 3 bins here? using 4 gives you what you want. If you want more control, then it seems you really want to know how many of each of the values 1,2,3,4 there are. so you want 4 bins, each *centered* on the integers, so you might do: In [8]: np.histogram(x, bins=4, range=(0.5, 4.5)) Out[8]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) or, if you want to be more explicit: In [14]: np.histogram(x, bins=np.linspace(0.5, 4.5, 5)) Out[14]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) HTH, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication
Brandt Belson wrote: Unfortunately I can't flatten the arrays. I'm writing a library where the user supplies an inner product function for two generic objects, and almost always the inner product function does large array multiplications at some point. The library doesn't get to know about the underlying arrays. Now I'm confused -- if the user is providing the inner product implementation, how can you optimize that? Or are you trying to provide said user with an optimized large array multiplication that he/she can use? If so, then I'd post your implementation here, and folks can suggest improvements. If it's regular old element-wise multiplication: a*b (where a and b are numpy arrays) then you are right, numpy isn't using any fancy multi-core aware optimized package, so you should be able to make a faster version. You might try numexpr also -- it's pretty cool, though may not help for a single operation. It might give you some ideas, though. http://www.scipy.org/SciPyPackages/NumExpr -Chris Thanks, Brandt Message: 2 Date: Fri, 10 Jun 2011 09:23:10 -0400 From: Olivier Delalleau sh...@keba.be mailto:sh...@keba.be Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication To: Discussion of Numerical Python numpy-discussion@scipy.org mailto:numpy-discussion@scipy.org Message-ID: banlktikjppc90ye56t1mr+byaxxaw32...@mail.gmail.com mailto:banlktikjppc90ye56t1mr%2bbyaxxaw32...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 It may not work for you depending on your specific problem constraints, but if you could flatten the arrays, then it would be a dot, and you could maybe compute multiple such dot products by storing those flattened arrays into a matrix. -=- Olivier 2011/6/10 Brandt Belson bbel...@princeton.edu mailto:bbel...@princeton.edu Hi, Thanks for getting back to me. I'm doing element wise multiplication, basically innerProduct = numpy.sum(array1*array2) where array1 and array2 are, in general, multidimensional. I need to do many of these operations, and I'd like to split up the tasks between the different cores. I'm not using numpy.dot, if I'm not mistaken I don't think that would do what I need. Thanks again, Brandt Message: 1 Date: Thu, 09 Jun 2011 13:11:40 -0700 From: Christopher Barker chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov Subject: Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication To: Discussion of Numerical Python numpy-discussion@scipy.org mailto:numpy-discussion@scipy.org Message-ID: 4df128fc.8000...@noaa.gov mailto:4df128fc.8000...@noaa.gov Content-Type: text/plain; charset=ISO-8859-1; format=flowed Not much time, here, but since you got no replies earlier: I'm parallelizing some code I've written using the built in multiprocessing module. In my application, I need to multiply many large arrays together is the matrix multiplication, or element-wise? If matrix, then numpy should be using LAPACK, which, depending on how its built, could be using all your cores already. This is heavily dependent on your your numpy (really the LAPACK it uses0 is built. and sum the resulting product arrays (inner products). are you using numpy.dot() for that? If so, then the above applies to that as well. I know I could look at your code to answer these questions, but I thought this might help. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 tel:%28206%29%20526-6959 voice 7600 Sand Point Way NE (206) 526-6329 tel:%28206%29%20526-6329 fax Seattle, WA 98115 (206) 526-6317 tel:%28206%29%20526-6317 main reception chris.bar...@noaa.gov mailto:chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman
Re: [Numpy-discussion] fixing up datetime
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Using multiprocessing (shared memory) with numpy array multiplication
Not much time, here, but since you got no replies earlier: I'm parallelizing some code I've written using the built in multiprocessing module. In my application, I need to multiply many large arrays together is the matrix multiplication, or element-wise? If matrix, then numpy should be using LAPACK, which, depending on how its built, could be using all your cores already. This is heavily dependent on your your numpy (really the LAPACK it uses0 is built. and sum the resulting product arrays (inner products). are you using numpy.dot() for that? If so, then the above applies to that as well. I know I could look at your code to answer these questions, but I thought this might help. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] code review for datetime arange
Mark Wiebe wrote: Because of the nature of datetime and timedelta, arange has to be slightly different than with all the other types. In particular, for datetime the primary signature is np.arange(datetime, datetime, timedelta). I've implemented a simple extension which allows for another way to specify a date range, as np.arange(datetime, timedelta, timedelta). instead of, or in addition to, the above? it seems you can pass in the following types: strings np.datetime64 np.timedelta64 integers (floats ?) Are you essentially doing method overloading to determine what it all means? How do you know if: np.arange('2011', '2020', dtype='M8[Y]') means you want from the years 2011 to 2020 or from 2011 to 4031? np.arange('today', 10, 3, dtype='M8') array(['2011-06-09', '2011-06-12', '2011-06-15', '2011-06-18'], dtype='datetime64[D]') so dtype 'M8' defaults to increments of days? of course, I've lost track of the difference between 'M' and 'M8' (I've never liked the dtype code anyway -- I far prefer np.float64 to 'd', for instance) Will there be a linspace for datetimes? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy c api general array handling
Mark Wiebe wrote: I believe what you may want is PyArray_ContiguousFromAny instead of PyArray_ContiguousFromObject. I would also strongly encourage you to take a look at Cython: http://cython.org/ It has built-in support for numpy arrays, so it can take care of a lot of this bookkeeping for you. http://wiki.cython.org/tutorials/numpy -Chris Cheers, Mark On Wed, Jun 8, 2011 at 2:58 AM, Yoshi Rokuko yo...@rokuko.net mailto:yo...@rokuko.net wrote: hey, i'm writing my first python module in c using numpy/arrayobject.h and i have some problems with different platforms (both linux but really different setup) so i suspect my array handling is not cor- rect. i'm used to c arrays and want to access large numpy arrays from within my c module with no strange array-iter-methods. So what are your remarks to the following: PyArg_ParseTuple(args, OO|ii, arg1, arg2, start, stop); index = (PyArrayObject *) PyArray_ContiguousFromObject(arg1, PyArray_INT, 1, 1); ix = (int *)index-data; then using like: for(k = ix[i]; k ix[i+1]; k++) l = ix[k]; this seems to work pretty well, but real problems come with: PyArg_ParseTuple(args, O, arg); adjc = (PyArrayObject *) PyArray_ContiguousFromObject(arg, PyArray_INT, 2, 2); a = (int *)adjc-data; and then using like: aik = a[i + n * k]; it seems like on one system it does not accept 2d numpy arrays just 1d ones or i must hand him a list of 1d numpy arrays like that: A = np.array([[1,0],[0,1]]) B = my.method(list(A)) i would prefer to avoid the list() call in python. what are your remarks on performance would it be faster to do: PyArg_ParseTuple(args, OO|ii, arg1, arg2, start, stop); index = (PyArrayObject *) PyArray_ContiguousFromObject(arg1, PyArray_INT, 1, 1); ix = malloc(n * sizeof(int)); for(i = 0; i n; i++) ix[i] = (int *)index-data[i]; and then use ix[] (i call a lot on ix). thank you and best regards, yoshi ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Pierre GM wrote: Using the ISO as reference, you have a good definition of months. Yes, but only one. there are others. For instance, the climate modelers like to use a calendar that has 360 days a year: 12 30 day months. That way they get something with the same timescale as months and years, but have nice, linear, easy to use units (differentiable, and all that). Mark Wiebe wrote: CodeInterpreted as Y 12M, 52W, 365D M 4W, 30D, 720h This is even self inconsistent: 1Y == 365D 1Y == 12M == 12 * 30D == 360D 1Y == 12M == 12 * 4W == 12 * 4 * 7D == 336D 1Y == 52W == 52 * 7D == 364D Is it not clear from this what a mess of mis-interpretation might result from all that? This part of the code is used for mapping metadata like [Y/4] - [3M], or [Y/26] - [2W]. I agree that this '/' operator in the unit metadata is weird, and wouldn't object to removing it. Weird, dangerous, and unnecessary. I can see how some data may be on, for example quarters, but that should require a definition of quarters that's more defined. This goes to heck is the data is expressed in something like months since 1995-01-01 Because months are only defined on a Calendar. Here's what the current implementation can do with that one: np.datetime64('1995-01-01', 'M') + 13 numpy.datetime64('1996-02','M') I see -- I have a better idea of the intent here, and I can see that as long as you keep everything in the same unit (say, months, in this case), then this can be a clean and effective way to deal with this sort of data. As I said, the netcdf case is a different use case, but I think the issue there was that the creator of the data was thinking of it as being used like above: months since January, 1995, and the data was all integer values for months, it makes perfect sense, and is well defined. The problem in that case is that the standard does not have a specification that enforces that the units stay months, and that the intervals are integers -- so software looked at that, converted it to, for example, python datetime instances, using some pre-defined definition for the length of a month), and gt something that mis-represented the data. The numpy use-case is different, but it's my concern that that same kind of thing could easily happen, because people want to write generic code that deals with arbitrary np.datetime64 instances. I suppose we could consider this analogous to issues with integer an floating point dtypes -- when you convert between those, it's user-beware, but I think that would be more clear if we had a set of dtypes: datetime_months datetime_hours datetime_seconds But that list would get big in a hurry! Also, with the Python datetime module, for instance, what I like about it is that I don't have to know or care how it's stored internally -- all I need to know is what range and precision it can deal with. numpy has performance issues that may not make that possible, but I still like it. maybe two types: datetime_calendar: for Calendar-type units (months, business days, ...) datetime_continuous: for linear units (seconds, hours, ...) or something like that? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Dave Hirschfeld wrote: Robert Kern robert.kern at gmail.com writes: Well, [D/100] doesn't represent [864s]. It represents something that happens 100 times a day, but not necessarily at precise regular intervals. For example, suppose that I am representing payments that happen twice a month, say on the 1st and 15th of every month, or the 5th and 20th. I would use [M/2] to represent that. It's not [2W], and it's not [15D]. It's twice a month. Got it -- very useful concept, of course. The default conversions may seem to imply that [D/100] is equivalent to [864s], but they are not intended to. They are just a starting point for one to write one's own, more specific conversions. Here is my issue -- I don't think there should be default conversions at all -- as you point out, twice a month is a well defined concept, but it is NOT the same as every 15 days (or any other set interval). I'm not sure it should be possible to convert to a regular interval at all. That would be one way of dealing with irregularly spaced data. I would argue that the example is somewhat back-to-front though. If something happens twice a month it's not occuring at a monthly frequency, but at a higher frequency. And that frequency is 2/month. In this case the lowest frequency which can capture this data is daily frequency so it should be stored at daily frequency I don't think it should, because it isn't 1/15 days, or, indeed, an frequency that can be specified as days. Sure you can specify the 5th and 20th of each month in a given time span in terms of days since an epoch, but you've lost some information there. When you want to do math -- like add a certain number of 1/2 months -- when is the 100th payment due? It seems keeping it in M/2 is the most natural way to deal with that -- then you don't need special code to do that addition, only when converting to a string (or other format) datetime. Anyway, my key point is that converting to/from calendar-based units and linear time units is fraught with peril -- it needs to be really clear when that is happening, and the user needs to have a clear and ideally easy way to define how it should happen. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Mark Wiebe wrote: I'm wondering if removing the business-day unit from datetime64, and adding a business-day API would be a good approach to support all the things that are needed? That sounds like a good idea to me -- and perhaps it could be a general Calendar Functions API, to handle other issues as well. Looking again at the NEP, I see business day as a time unit, then below, that B is interpreted as 24h, 1440m, 86400s i.e. a day. This seems like a really bad idea -- it implies that you can convert from business day to hours, and back again, but, of course, if you do: start_date + n business_days You should not get the same thing as: start_date + (n * 24) hrs which is why I think that a business day is not really a unit -- it is a specification for a Calendar operation. My thought is that a datetime should not be able to be expressed in units like business days. A timedelta should, but then there needs to be a Calendar(i.e. a set of rules) associated with it, and it should be clearly distinct from linear unit timedeltas. Business days aside, I also see that months and years are given Interpreted as definitions: CodeInterpreted as Y 12M, 52W, 365D M 4W, 30D, 720h This is even self inconsistent: 1Y == 365D 1Y == 12M == 12 * 30D == 360D 1Y == 12M == 12 * 4W == 12 * 4 * 7D == 336D 1Y == 52W == 52 * 7D == 364D Is it not clear from this what a mess of mis-interpretation might result from all that? In thinking more, numpy's needs are a little different that the netcdf standards -- netcdf is a way to transmit and store information -- some of the key problems that come up is that people develop tools to do things with the data that make assumptions. For instance, my tools that work with CF-compliant netcdf pretty much always convert the time axis (expresses as time_units since a_datetime) into python datetime objects, and then go from there. This goes to heck is the data is expressed in something like months since 1995-01-01 Because months are only defined on a Calendar. Anyway, this kind of thing may be less of an issue because we use numpy to write the tools, not to store and share the data, so hopefully, the tool author knows what she/he is working with. But I still think the distinction between linear, convertible, time units, and time units that vary depending on where you are on which calendar, and what calendar you are using, should be kept clearly distinct. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.histogram: upper range bin
Peter Butterworth wrote: in np.histogram the top-most bin edge is inclusive of the upper range limit. As documented in the docstring (see below) this is actually the expected behavior, but this can lead to some weird enough results: In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) Out[72]: (array([1, 1, 2]), array([ 1., 2., 3., 4.])) Is there any way round this or an alternative implementation without this issue ? The way around it is what you've identified -- making sure your bins are right. But I think the current behavior is the way it should be. It keeps folks from inadvertently loosing stuff off the end -- the lower end is inclusive, so the upper end should be too. In the middle bins, one has to make an arbitrary cut-off, and put the values on the line somewhere. One thing to keep in mind is that, in general, histogram is designed for floating point numbers, not just integers -- counting integers can be accomplished other ways, if that's what you really want (see np.bincount). But back to your example: In [72]: x=[1, 2, 3, 4]; np.histogram(x, bins=3) Why do you want only 3 bins here? using 4 gives you what you want. If you want more control, then it seems you really want to know how many of each of the values 1,2,3,4 there are. so you want 4 bins, each *centered* on the integers, so you might do: In [8]: np.histogram(x, bins=4, range=(0.5, 4.5)) Out[8]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) or, if you want to be more explicit: In [14]: np.histogram(x, bins=np.linspace(0.5, 4.5, 5)) Out[14]: (array([1, 1, 1, 1]), array([ 0.5, 1.5, 2.5, 3.5, 4.5])) HTH, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Charles R Harris wrote: Good support for units and delta times is very useful, but parsing dates and times and handling timezones, daylight savings, leap seconds, business days, etc., is probably best served by addon packages specialized to an area of interest. Just my $.02 I agree here -- I think for numpy, what's key is to focus on the kind of things needed for computational use -- that is the performance critical stuff. I suppose business-day type calculations would be both key and performance-critical, but that sure seems like the kind of thing that should go in an add-on package, rather than in numpy. The stdlib datetime package is a little bit too small for my taste (couldn't I at least as for a TimeDelta to be expressed in, say, seconds, without doing any math on my own?), but the idea is good -- create the core types, let add-on packages do the more specialized stuff. * The existing datetime-related API is probably not useful, and in fact those functions aren't used internally anymore. Is it reasonable to remove the functions, or do we just deprecate them? I say remove, but some polling to see if anyone is using it might be in order first. * Leap seconds probably deserve a rigorous treatment, but having an internal representation with leap-seconds overcomplicates otherwise very simple and fast operations. could you explain more? I don't get the issues -- leap seconds would com e in for calculations like: a_given_datetime + a_timedelta, correct? Given leap years, and all the other ugliness, does leap seconds really make it worse? * Default conversion to string - should it be in UTC or with the local timezone baked in? most date_time handling should be time-zone neutral --i.e. assume everything is in the same timezone (and daylight savings status). Libs that assume you want the locale setting do nothing but cause major pain if you have anything out of the ordinary to do (and sometimes ordinary stuff, too). If you MUST include time-zone, exlicite is better than implicit -- have the user specify, or, at the very least make it easy for the user to override any defaults. As UTC it may be confusing because 'now' will print as a different time than people would expect. I think now should be expressed (but also stored) in the local time, unless the user asks for UTC. This is consistent with the std lib datetime.now(), if nothing else. * Should the NaT (not-a-time) value behave like floating-point NaN? i.e. NaT == NaT return false, etc. Should operations generating NaT trigger an 'invalid' floating point exception in ufuncs? makes sense to me -- at least many folks are used to NaN symantics. And after the removal of datetime from 1.4.1 and now this, I'd be in favor of putting a large experimental sticker over the whole thing until further notice. Do we have a good way to do that? Maybe a experimental warning, analogous to the deprecation warning. Good support for units and delta times is very useful, This part works fairly well now, except for some questions like what should datetime(2011-01-30, D) + timedelta(1, M) produce. Maybe 2011-02-28, or 2011-03-02? Neither -- month should not be a valid unit to express a timedelta in. Nor should year, or anything else that is not clearly defined (we can argue about day, which does change a bit as the earth slows down, yes?) Yes, it's nice to be able to easily have a way of expressing things like every month, or a month from now when you mean a calendar month, but it's a heck of a can of worms. We just had a big discussion about this in the netcdf CF metadata standards list: http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2011/007807.html We more or less came to the conclusion (I did, anyway) that there were two distinct, but related concepts: 1) time as a strict unit of measurement, like length, mass, etc. In that case, don't use months as a unit. 2) Calendars -- these are what months, days of week, etc, etc, etc. are from, and these get ugly. I also learned that there are even more calendars than I thought. Beyond the Julian, Gregorian, etc, there are special ones used for climate modeling and the like, that have nice properties like all months being 30 days long, etc. Plus, as discussed, various business calendars. So: I think that the calendar-related functions need fairly self contained library, with various classes for the various calendars one might want to use, and a well specified way to define new ones. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Benjamin Root wrote: Just throwing it out there. Back in my Perl days (I do not often admit this), the one module that I thought was really well done was the datetime module. Now, I am not saying that we would need to support the calendar from Lord of the Rings or anything like that, but its floating timezone concept and other features were quite useful. Maybe there are lessons to be learned from there? Another one to look at is the classes in the Earth System Modeling Framework (ESMF). That will give you a good idea what the atmospheric/oceanographic/climate modelers need, and how at least one well-respected group has addresses these issues. http://www.earthsystemmodeling.org/esmf_releases/non_public/ESMF_5_1_0/ESMC_crefdoc/node6.html (If that long URL breaks, I found that by googling: ESMF calendar date time) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Pierre GM wrote: I also don't think that units like month, year, business day should be allowed -- it just adds confusion. It's not a killer if they are defined in the spec: -1 In scikits.timeseries, not only did we have years,months and business days, but we also had weeks, that proved quite useful sometimes, Actually, I have little issue with weeks -- it can be clearly defined as 7 * 24 hours. The problem with months is that they are not all the same length. One of the issues pointed out in the discussion on the CF list was that for the most part, scientist expect that a time series has nice properties, like differentiability, etc. That all goes to heck if you have months as a unit. Unless a month is clearly defined as a particular number of hours, and ONLY means that, but then some folks may not get what they expect. Anyhow, years and months are simple enough. no, they are not -- they are fundamentally different than hours, days, etc. In case of ambiguities like Mark's example, we can always raise an exception. sure -- but what I'm suggesting is that rather than have one thing (a datetime array) that behaves differently according to what units it happens to use, we clearly separate what I'm calling Calendar functionality for basic time functionality. What that means to me is that a TimeDelta is ALWAYS an unambiguous time unit, and a datetime is ALWAYS expressed as an unambiguous time unit since a well-defined epoch. If you want something like 6 months after a datetime, or 14 business days after a datetime that should be handles by a calendar class of some sort. (hmm, maybe I'm being too pedantic here -- the Calendar class could be associated with the datetime and timedelta objects, I suppose) so a given timedelta knows how to do math with itself. But I think this is fraught with problems: one of the keys to usability ond performance of the datetiem arrays is that they are really arrays of integers, and you can do math with them just as though they are integers. That will break if you allow these non-linear time steps. So, thinking out loud here, maybe: datetime timedelta calendartimedelta to make it very clear what one is working with? as for calendardatetime, at first blush, I don't think there is a need. Do you need to express a datetime as n months from the epoch, rather than 12:00am on July 1, 1970 -- which could be represented in hours, minutes, seconds, etc? What I'm getting at is that the difference between calendars is in what timedeltas mean, not what a unit in time is. ISO8601 seems quite OK. All that does is specify a string representation, no? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Mark Wiebe wrote: It is possible to implement the system so that if you don't use Y/M/B, things work out unambiguously, but if you do use them you get a behavior that's a little weird, but with rules to eliminate the calendar-created ambiguities. yes, but everyone wants different rules -- so it needs to be very clear which rules are in place, and there needs to be a way for a user to specify his/her own rules. For the business day unit, what I'm currently trying to do is get an assessment of whether my proposed design the right abstraction to support all the use cases of people who want it. Good plan. I rather agree here, adding the 'origin' back in is definitely worth considering. How is the origin represented in the CF netcdf code? As a ISO 1601 string. I can't recall if you have the option of specifying a non-standard calendar. So using the calendar specified by ISO 8601 as the default for the calendar-based functions is undesirable? no -- that's fine -- but does ISO 8601 specify stuff like business day? I think supporting it to a small extent is reasonable, and support for any other calendars or more advanced calendar-based functions would go in support libraries. yup -- what I'm trying to press here is the distinction between linear time units and the weird concepts, like business day, month, etc. I think there are two related, but distinct issues: 1) representation/specification of a datetime. The idea here is that imagine that there is a continuous property called time (which I suppose has a zero at the Big Bang). We need a way to define where (when) in that continuum a given event, or set of events occurred. This is what the datetime dtype is about. I think the standard of some-time-unit since some-reference-datetime, in some-calendar is fine, but that the time-unit should be unambiguously and clearly defined, and not change with when it occurs, i.e. seconds, hours, days, but not months, years, or business days. 2) time spans, and math with time: i.e. timedeltas --- this falls into 2 categories: a) simple linear time units: seconds, hours, etc. This is quite straightforward, if working with other time deltas and datetimes all expressed in well-defined linear units. b) calendar manipulations: months since, business days since, once a month, the first sunday of teh month, next monday. These require a well defined and complex Calendar, and there are many possible such Calendars. What I'm suggesting is that (a) and (b) should be kept quite distinct, and that it should be fairly easy to define and use custom Calendars defined for (b). (a) and (b) could be merged, with various defaults and exceptions raised for poorly defined operations, but I think that'll be less clear, harder to implement, and more prone to error. A little example, again from the CF mailing list (which spawned the discussion). In the CF standard the units available are defined as those supported by the udunits library: http://www.unidata.ucar.edu/software/udunits/ It turns out that udunits only supports time manipulation as I specified as (a) i.e. only clearly defined linear time units. However, they do define months and years, as specific values (something like 365.25 days/year and 12 months/year -- though they also have Julian-year, leap_year, etc) So folks would specify a time axes as : months since 2010-01 and expect that they were getting calandar months, like 1 would mean Feb, 2010, instaed of January 31, 2010 (or whatever). Anyway, lots of room for confusion, so whatever we come up with needs to be clearly defined. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] New functions.
On 5/31/11 6:08 PM, Charles R Harris wrote: 2) Ufunc fadd (nanadd?) Treats nan as zero in addition. so: In [53]: a Out[53]: array([ 1., 2., nan]) In [54]: b Out[54]: array([0, 1, 2]) In [55]: a + b Out[55]: array([ 1., 3., nan]) and nanadd(a,b) would yield: array([ 1., 3., 2.) I don't see how that is particularly useful, at least not any more useful that nanprod, nandiv, etc, etc... What am I missing? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] what does in do with numpy arrays?
Hi folks, I've re-titled this thread, as it's about a new question, now: What does: something in a_numpy_array mean? i.e. how has __contains__ been defined? A couple of us have played with it, and can't make sense of it: In [24]: a Out[24]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) In [25]: 3 in a Out[25]: True So the simple case works just like a list. But what If I look for an array in another array? In [26]: b Out[26]: array([3, 6, 4]) In [27]: b in a Out[27]: False OK, so the full b array is not in a, and it doesn't vectorize it, either. But: In [29]: a Out[29]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]]) In [30]: b in a Out[30]: True HUH? I'm not sure by what definition we would say that b is contained in a. but maybe.. In [41]: b Out[41]: array([ 4, 2, 345]) In [42]: b in a Out[42]: False so it's are all of the elements in b in a somewhere? but only for 2-d arrays? So what does it mean? The docstring is not helpful: In [58]: np.ndarray.__contains__? Type: wrapper_descriptor Base Class: type 'wrapper_descriptor' String Form: slot wrapper '__contains__' of 'numpy.ndarray' objects Namespace:Interactive Docstring: x.__contains__(y)== y in x On 5/29/11 2:50 PM, eat wrote: FWIW, a short prelude on the theme seems quite promising, indeed: In []: A Out[]: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In []: [0, 1, 2] in A Out[]: True In []: [0, 3, 6] in A Out[]: True In []: [0, 4, 8] in A Out[]: True In []: [8, 4, 0] in A Out[]: True In []: [2, 4, 6] in A Out[]: True In []: [6, 4, 2] in A Out[]: True In []: [3, 1, 5] in A Out[]: True In [1061]: [3, 1, 4] in A Out[1061]: True But In []: [1, 2, 3] in A Out[]: False In []: [3, 2, 1] in A Out[]: True So, obviously the logic behind __contains__ is not so very straightforward. Perhaps just a bug? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy question
On 5/27/11 11:06 AM, Talla wrote: I am using python24 and Numeric under windows XP and I am trying to create *.agr file from *.dbs file, I followed the instructions very carefully and every time I am getting the following message: Traceback (most recent call last): File AbinitBandStructuremaker.py, line 1274, in ? data.fermienergy = max(maxlist) ValueError: max() arg is an empty sequence Nothing wrong in the .dbs file This is really a question for the author of AbinitBandStructuremaker.py. If there is no one familiar with that code tha can help you'll need to debug it yourself. That error means that in this case, maxlist is an empty list, so tring to find teh maximum value in it is not possible. It also looks like that is the built-in python max(0, not the Numeric one, so this isn't really a numpy-related question anyway. without more context, we can't even begin to figure out what's wrong. Good luck, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] finding elements that match any in a set
On 5/27/11 9:48 AM, Michael Katz wrote: I have a numpy array, records, with named fields including a field named integer_field. I have an array (or list) of values of interest, and I want to get the indexes where integer_field has any of those values. Because I can do indexes = np.where( records.integer_field 5 ) I thought I could do indexes = np.where( records.integer_field in values ) But that doesn't work. (As a side question I'm interested in why that doesn't work, when values is a python list.) that doesn't work because the python list in operator doesn't understand arrays -- so it is looking ot see if the entire array is in the list. actually, it doesn't even get that far: In [16]: a in l --- ValueErrorTraceback (most recent call last) /Users/chris.barker/ipython console in module() ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() The ValueError results because it was decided that numpy array should not have a boolean value to avoid confusion -- i.e. is na array true whenever it is non-empty (like a list), or when all it's elements are true, or When I read this question, I thought -- hmmm, numpy needs something like in, as the usual way: np.any(), would require a loop in this case. Then I read Skipper's message: On 5/27/11 9:55 AM, Skipper Seabold wrote: Check out this recent thread. I think the proposed class does what you want. It's more efficient than in1d, if values is small compared to the length of records. So that class may be worthwhile, but I think np.in1d is exactly what you are looking for: indexes = np.in1d( records.integer_field, values ) Funny I'd never noticed that before. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Less dimensions than expected with record array
ndarray has 2 dimensions but record array has 1 dimensions This makes seemingly reasonable things, like using apply_along_axis() over a table of data with named columns, impossible: each row (record) is treated as one array element, so the structured array is only 1d. If you have rows/records with content that is not homogenous, then working along axis=1 (across elements of a record) doesn't make sense. for example I just struggle with 2 datetime columns and the rest are integers. If you want an array with homogenous elements (all floats or all ints) with operations along axis, then larry (la) is, I think, still the best bet. another option is to use views. There are time when I want the same array visible as both a structured array, and a regular old array, depending on what I'm doing with it. and you can do that: In [77]: data Out[77]: [(1, 2, 3), (4, 5, 6), (7, 8, 9)] In [80]: dt = [('a',int),('b',int),('c',int)] In [81]: record_array = np.array(data, dtype=dt) In [84]: array = record_array.view(dtype=np.int).reshape(-1,3) In [85]: array Out[85]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # array and record_array share the same data: In [88]: array[:,1] *= 2 In [89]: array Out[89]: array([[ 1, 4, 3], [ 4, 10, 6], [ 7, 16, 9]]) In [90]: record_array Out[90]: array([(1, 4, 3), (4, 10, 6), (7, 16, 9)], dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'i4')]) views are one of the really cool things about numpy. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array of size 'n' with common difference 1
On 4/29/11 12:31 AM, pratik wrote: On Friday 29 April 2011 12:56 PM, dileep kunjaai wrote: Dear sir, I am trying to make an array of varies from -60 to 90 with difference 0.25. I tried the following command ... import numpy as N lat=N.array(xrange(-6000, 9000, 25), dtype=float) print lat/100 xrange() (or range(), or np.arange()) is almost never the right solution for floating point ranges, due to the intricacies of floating point precision. lat =numpy.mgrid[-60:90:.25] or np.linspace: np.linspace(-60,90,((60.+90.)*4. + 1)) ((60.+90.)*4. + 1) is the number of points you want -- the +1 because you want both end points. mgrid is usually used for 2-d (or higher) grids, though it looks like it makes sense for this use, too, though note that it doesn't give you both endpoints in this case. From the docs: If the step length is not a complex number, then the stop is not inclusive. and an example: In [15]: np.mgrid[-1:3:.25] Out[15]: array([-1. , -0.75, -0.5 , -0.25, 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. , 2.25, 2.5 , 2.75]) I think this is too bad, actually, because we're back to range()-type tricks to get the end point: In [20]: np.mgrid[-1:3.25:.25] Out[20]: array([-1. , -0.75, -0.5 , -0.25, 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. , 2.25, 2.5 , 2.75, 3. ]) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array of size 'n' with common difference 1
On 4/29/11 1:27 PM, Sebastian Haase wrote: Just for completeness, note this paragraph from the mgrid docs: However, if the step length is a *complex number* (e.g. 5j), then the integer part of its magnitude is interpreted as specifying the number of points to create between the start and stop values, where the stop value *is inclusive*. OK -- for a kluge, I figure you could do complex, then take the real part, but I can't seem to get a complex grid to work at all: In [37]: np.mgrid[(0+0j):(1+0j):(0.1+0j)] Out[37]: array([], dtype=complex128) What am I doing wrong? Go it -- the docs could use some clarification here -- Actually, passing in a complex nubmer is a FLAG, indicating that the index means somethign different, but it's still doing everything with real numbers (floats). Wow, klunky API! but here is what the OP would want: np.mgrid[-60:90:((60.j+90.j)*4. + 1j)] which is klunkier that linspace. I'd have used two different function names for the different mgrid functionality, rather that than the complex kludge -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numpy 1.6.0 beta 2
Sorry to keep harping on this, but for history's sake, I was one of the folks that got 'U' introduced in the first place. I was dealing with a nightmare of unix, mac and dos test files, 'U' was a godsend. On 4/5/11 4:51 PM, Matthew Brett wrote: The difference between 'rt' and 'U' is (this is for my own benefit): For 'rt', a '\r' does not cause a line break - with 'U' - it does. Perhaps semantics, but what 'U' does is actually change any of the line breaks to '\n' -- any line breaking happens after the fact. In Py2, the difference between 'U' and 't' is that 't' assumes that any file read uses the native line endings -- a bad idea, IMHO. Back in the day, Guido argued that text file line ending conversion was the job of file transfer tools. The reality, however, is that users don't always use file transfer tools correctly, nor even understand the implications of line endings. All that being said, mac-style files are pretty rare these days. (though I bet I've got a few still kicking around) Right - my argument is that the behavior implied by 'U' and 't' is conceptually separable. 'U' is for how to do line-breaks, and line-termination translations, 't' is for whether to decode the text or not. In python 3. but 't' and 'U' are the same in python 3 -- there is no distinction. It seems you are arguing that there could/should be a way to translate line termination without decoding the text, but ... In python 3 a binary file is a file which is not decoded, and returns bytes. It still has a concept of a 'line', as defined by line terminators - you can iterate over one, or do .readlines(). I'll take your word for it that it does, but that's not really a binary file then, it's a file that you are assuming is encoded in an ascii-compatible way. While I know that practicality beats purity, we really should be opening the file as a text file (it is text, after all), and specifying utf-8 or latin-1 or something as the encoding. However, IIUC, then the issue here is later on down the line, numpy uses regular old C code, which expects ascii strings. In that case, we could encode the text as ascii, into a bytes object. That's a lot of overhead for line ending translation, so probably not worth it. But if nothing else, we should be clear in the docs that numpy text file reading code is expecting ascii-compatible data. (and it would be nice to get the line-ending translation) Right - so obviously if you open a utf-16 file as binary, terrible things may happen - this was what Pauli was pointing out before. His point was that utf-8 is the standard, but it's not the standard -- it's a common use, but not a standard -- ideally numpy wouldn't enforce any particular encoding (though it could default to one, and utf-8 would be a good choice for that) Once you really make the distiction between text and binary, the concept of a line terminator doesn't really make sense anyway. Well - I was arguing that, given we can iterate over lines in binary files, then there must be the concept of what a line is, in a binary file, and that means that we need the concept of a line terminator. maybe, but that concept is built on a assumption that your file is ascii-compatible (for \n anyway), and you know what they say about assumptions... I realize this is a discussion that would have to happen on the python-dev list... I'm not sure -- I was thinking that python missed something here, but I don't think so anymore. In the unicode world, there is not choice but to be explicit about encodings, and if you do that, then python's text or binary distinction makes sense. .readline() for binary file doesn't, but so be it. Honestly, I've never been sure in this discussion what code actually needs fixing, so I'm done now -- we've talked enough that the issues MUST have been covered by now! -Chris ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numpy 1.6.0 beta 2
On 4/5/11 10:33 PM, Matthew Brett wrote: Did you mean to send this just to me? It seems like the whole is generally interesting and helpful, at least to me... I did mean to send to the list -- I've done that now. Well, the current code doesn't split on \r in py3k, admittedly that must be a rare case. I guess that's a key question here -- It really *should* split on /r, but maybe it's rare enough to be unimportant. The general point about whether to read binary or text is also in play. I agree with you, reading text sounds like a better idea to me, but I don't know the performance hit. Pauli had it as reading binary in the original conversion and was defending that in an earlier email... The thing is -- we're talking about parsing text here, so we really are reading text files, NOT binary files. So the question really is -- do we want py3's file reading code to handle encoding issues for us, or do we want to handle them ourselves. If we only want to support ansi encodings, then handling ourselves may well be easier and faster performing. If we go that way we need to handle line-endings, too. The easy way is to only support line endings with a '\n' in them -- that works out of the box. But it's not that hard to support 'r' also, depending on how you want to do it. Before 'U' existed, I did that all the time, something like: some_text = file.read(some_size_buffer) some_text.replace('\r\n', '\n') some_text.replace('\r', '\n') lines = some_text.split('\n') (by the way, if you're going to support this, it's really nice to support mixed line-endings (like this approach does) -- there are a lot of editors that can make a mess of line endings. If you can read the entire file into memory at once, this is almost trivial, if you can't -- there is a bit more bookeeping code to be written. DARN -- I think I said my last note was the last on this topic! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rounding the decimal part of a real number
On 4/6/11 6:24 AM, Alan G Isaac wrote: http://docs.scipy.org/doc/numpy/reference/generated/numpy.round_.html simple enough, of course, but just to be clear: In [108]: np.round(1.23456789, 3) Out[108]: 1.2351 so the number is rounded to the requested number of decimal places, but then stored in a binary floating point format, which may not be able to exactly represent that rounded number -- hence the 1 at the end. This is simply how floating point works. and that slight difference _probably_ doesn't matter, but it's worth being aware of, because is does make a difference occasionally. python has a decimal type that can work with exact decimal numbers -- numpy doesn't support that, as there is no hardware support for it (at least on most platforms). If you want to display it differently, you can use the string formatters: In [110]: %.3f%np.round(1.23456789, 3) Out[110]: '1.235' HTH, Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numpy 1.6.0 beta 2
On 4/4/11 10:35 PM, Charles R Harris wrote: IIUC, Ub is undefined -- U means universal newlines, which makes no sense when used with b for binary. I looked at the code a ways back, and I can't remember the resolution order, but there isn't any checking for incompatible flags. I'd expect that genfromtxt, being txt, and line oriented, should use 'rU'. but if it wants the raw line endings (why would it?) then rb should be fine. U has been kept around for backwards compatibility, the python documentation recommends that it not be used for new code. That is for 3.* -- the 2.7.* docs say: In addition to the standard fopen() values mode may be 'U' or 'rU'. Python is usually built with universal newline support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program. If Python is built without universal newline support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n', '\r', '\r\n', or a tuple containing all the newline types seen. Python enforces that the mode, after stripping 'U', begins with 'r', 'w' or 'a'. which does, in fact indicate that 'Ub' is NOT allowed. We should be using 'Ur', I think. Maybe the python enforces is what we saw the error from -- it didn't used to enforce anything. On 4/5/11 7:12 AM, Charles R Harris wrote: The 'Ub' mode doesn't work for '\r' on python 3. This may be a bug in python, as it works just fine on python 2.7. Ub never made any sense anywhere -- U means universal newline text file. b means binary -- combining them makes no sense. On older pythons, the behaviour of 'Ub' was undefined -- now, it looks like it is supposed to raise an error. does 'Ur' work with \r line endings on Python 3? According to my read of the docs, 'U' does nothing -- universal newline support is supposed to be the default: On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. It may indeed be desirable to read the files as text, but that would require more work on both loadtxt and genfromtxt. Why can't we just open the file with mode 'Ur'? text is text, messing with line endings shouldn't hurt anything, and it might help. If we stick with binary, then it comes down to: - will having an extra \r with Windows files hurt anything? -- probably not. - Are there many mac-style text files out there anymore? not many. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] convert array to tuples?
HI folks, What's the fastest way to convert a numpy array to tuples? Unfortunately, not all packages take arbitrary sequences when they are expecting a list of tuples. In this case, I'm using PIL's ImageDraw, and want to do: draw.line(points) (points is an Nx2 numpy array of ints) but if I pass in an array, I get, oddly enough, nothing. NO error, just nothing drawn. if I do: draw.line(points.tolist()) SystemError: new style getargs format but argument is not a tuple So I seem to need tuples. This works: draw.line([tuple(p) for p in points.tolist()]) but that seems like it would be slower than it should be (to be honest, not profiled). Is there a faster way? Maybe numpy should have a ndarray.totuple() method. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] convert array to tuples?
On 4/5/11 10:52 AM, Sebastian Haase wrote: How about fixing PIL... I know that there is not a good track record of getting patches into PIL , but did you get to the bottom of it and find how draw.line is implemented? no. I haven't. And yes adapting PIL to use the buffer interface for this kind of thing would be great. I also have wanted to do that for wxPython, where it would be welcome, but still haven't gotten around to it. BTW, is it drawing anti-aliased lines ? Not with the ImageDraw module, but there is a new AggDraw module that should be more pretty. On 4/5/11 11:49 AM, Mark S Niemczyk wrote: Just a suggestion (I am very new to Numpy), but wouldn't draw.line(tuple(points.tolist())) accomplish what you want. nope, tolist() creates a list of lists -- tuple() only converts the outer list to a tuple, which I why I did the list comprehension. On 4/5/11 10:53 AM, josef.p...@gmail.com wrote: does a structured dtype work? Good idea -- I'll give it a try -- indeed it does: dt = np.dtype([('x', np.int32),('y',np.int32)]) draw.line(points.view(dtype=dt).reshape((-1,)).tolist(), fill=rgb(0,255,0), width=4) but it's not any faster, alas. The fastest I've come up with is putting the elements in a single 1-d list. PIL can take either a sequence of tuples: [(x,y),(x,y),...] or the flattened version: [x,y,x,y,x,y,x,y,...] So this is the fastest: draw.line(points.flatten().tolist(), fill=rgb(0,255,0), width=4) Thanks all, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numpy 1.6.0 beta 2
On 4/5/11 3:36 PM, josef.p...@gmail.com wrote: I disagree that U makes no sense for binary file reading. I wasn't saying that it made no sense to have a U mode for binary file reading, what I meant is that by the python2 definition, it made no sense. In Python 2, the ONLY difference between binary and text mode is line-feed translation. As for Python 3: In python 3: 'b' means, return byte objects 't' means return decoded strings 'U' means two things: 1) When iterating by line, split lines at any of '\r', '\r\n', '\n' 2) When returning lines split this way, convert '\r' and '\r\n' to '\n' a) 'U' is default -- it's essentially the same as 't' (in PY3), so 't' means return decoded and line-feed translated unicode objects b) I think the line-feed conversion is done regardless of if you are iterating by lines, i.e. with a full-on .read(). At least that's how it works in py2 -- not running py3 here to test. If you support returning lines from a binary file (which python 3 does), then I think 'U' is a sensible thing to allow - as in this case. but what is a binary file? I THINK what you are proposing is that we'd want to be able to have both linefeed translation and no decoding done. But I think that's impossible -- aren't the linefeeds themselves encoded differently with different encodings? U looks appropriate in this case, better than the workarounds. However, to me the python 3.2 docs seem to say that U only works for text mode Agreed -- but I don't see the problem -- your files are either encoded in something that might treat newlines differently (UCS32, maybe?), in which case you'd want it decoded, or you are working with ascii or ansi or utf-8, in which case you can specify the encoding anyway. I don't understand why we'd want a binary blob for text parsing -- the parsing code is going to have to know something about the encoding to work -- it might as well get passed in to the file open call, and work with unicode. I suppose if we still want to assume ascii for parsing, then we could use 't' and then re-encode to ascii to work with it. Which I agree does seem heavy handed just for fixing newlines. Also, one problem I've often had with encodings is what happens if I think I have ascii, but really have a couple characters above 127 -- then the default is to get an error in decoding. I'd like to be able to pass in a flag that either skips the un-decodable characters or replaces them with something, but it doesn't look like you can do that with the file open function in py3. The line terminator is always b'\n' for binary files; Once you really make the distiction between text and binary, the concept of a line terminator doesn't really make sense anyway. In the ansi world, everyone should always have used 'U' for text. It probably would have been the default if it had been there from the beginning. People got away without it because: 1) dos line feeds have a \n in them anyway 2) most if the time it doesn't matter that there is an extra whitespace charater inther 3) darn few of us ever had to deal with the mac \r Now that we are in a unicode world (at least a little) there is simply no way around the fact that you can't reliably read a file without knowing how it is encoded. My thought at this point is to say that the numpy text file reading stuff only works on 1byte, ansi encoding (nad maybe only ascii), and be done with it. utf-8 might be OK -- I don't know if there are any valid files in, say latin-1 that utf-8 will barf on -- you may not get the non-ascii symbols right, but that's better than barfing. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Old tickets
On 3/31/11 8:41 AM, Benjamin Root wrote: The ticket is about the functions np.divide and np.power, not / and **. This currently does not work the same, unlike what's said above: Hmm, I didn't notice that distinction. So long as there is a consistent difference between the function and the operator, I am fine with that. whether we want np.power to behave differently from ** (if they are internally handled separately at all)... Please, no! the operator and functions should behave the same. I certainly always expect them to. For instance, I often write code with operators, then replace that with the functions so that i can add an out parameter for performance reasons -- I would be very surprised if the function were defined differently -- it would be a source of surprising bugs. So, in this case, the ** operator and np.power are identical, but diverges from the behavior of python's operator. I think consistency within numpy is more important that consistency with python -- I expect differences like this from python (long integers, different data types, etc) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Willing to contribute to SciPy NumPy ...
On 3/31/11 12:04 PM, Anthony Scopatz wrote: A good place to get started is helping out with documentation (see http://docs.scipy.org/numpy/Front%20Page/). Actually, according to the Jacob Kaplan-Moss's talk at PyCon this year (http://pycon.blip.tv/file/4881071/) Nice talk -- thanks for the link. (with which I am inclined to agree), documentation is probably not the best way to get started. I agree as well. However, I think newbies can be a great help in identifying holes in the documentation -- so when documentation is missing or unclear, ask here, and then you'll have something to contribute. -Chris Do check out the developer zone, maybe search the the open tickets and see if there are any bugs you want to tackle. Maybe write some tests for untested code you find. Good ideas as well, of course. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array views
On 3/26/11 10:32 AM, srean wrote: I am also interested in this. In my application there is a large 2d array, lets call it 'b' to keep the notation consistent in the thread. b's columns need to be recomputed often. Ideally this re-computation happens in a function. Lets call that function updater(b, col_index): The simplest example is where updater(b, col_index) is a matrix vector multiply, where the matrix or the vector changes. Is there anyway apart from using ufuncs that I can make updater() write the result directly in b and not create a new temporary column that is then copied into b ? Say for the matrix vector multiply example. Probably not -- the trick is that when an array is a view of a slice of another array, it may not be laid out in memory in a way that other libs (like LAPACK, BLAS, etc) require, so the data needs to be copied to call those routines. To understand all this, you'll need to study up a bit on how numpy arrays lay out and access the memory that they use: they use a concept of strided memory. It's very powerful and flexible, but most other numeric libs can't use those same data structures. Im not sure what a good doc is to read to learn about this -- I learned it from messing with the C API. TAke a look at any docs that talk about strides, and maybe playing with the stride tricks tools will help. A simple example: In [3]: a = np.ones((3,4)) In [4]: a Out[4]: array([[ 1., 1., 1., 1.], [ 1., 1., 1., 1.], [ 1., 1., 1., 1.]]) In [5]: a.flags Out[5]: C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False So a is a (3,4) array, stored in C_contiguous fashion, jsut like a regular old C array. A lib expecting data in this fashion could use the data pointer just like regular C code. In [6]: a.strides Out[6]: (32, 8) this means is is 32 bytes from the start of one row to the next, and 8 bytes from the start of one element to the next -- which makes sense for a 64bit double. In [7]: b = a[:,1] In [10]: b Out[10]: array([ 1., 1., 1.]) so b is a 1-d array with three elements. In [8]: b.flags Out[8]: C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False but it is NOT C_Contiguous - the data is laid out differently that a standard C array. In [9]: b.strides Out[9]: (32,) so this means that it is 32 bytes from one element to the next -- for a 8 byte data type. This is because the elements are each one element in a row of the a array -- they are not all next to each other. A regular C library generally won't be able to work with data laid out like this. HTH, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array views
On 3/26/11 10:12 AM, Pauli Virtanen wrote: On Sat, 26 Mar 2011 13:10:42 -0400, Hugo Gagnon wrote: [clip] a1 = b[:,0] a2 = b[:,1] ... and it works but that doesn't help me for my problem. Is there a way to reformulate the first code snippet above but with shallow copying? No. You need an 2-D array to own the data. The second way is the approach to use if you want to share the data. exactly -- but to clarify, it's not just about ownership, it's about layout of the data in memory. the data in a numpy array needs to be laid out in memory as one block, with consitent strides from one element to the next, one row to the next, etc. When you create an array from scratch (like your 2-d array here), you get one big block of memory. If you create each row separately, they each have their own block of memory that are unrelated -- there is no way to put those together into one block with consistent strides. So you need to create that big block first (the 2-d array), then you can reference parts of it for each row. See my previous note for a bit more discussion. Oh, and maybe the little presentation and sample code I gave to the Seattle Python Interest group will help: http://www.seapig.org/November2010Notes -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Scientific Software developer wanted in Seattle.
Scientific Software Developer NOAA Emergency Response Division Help us develop our next-generation oil spill transport model. Background: The Emergency Response Division (ERD) of NOAA's Office of Response and Restoration (ORR) provides scientific expertise to support the response to oil and chemical spills in the coastal environment. We played a major role in the recent Deepwater Horizon oil spill in the Gulf of Mexico. In order to fulfill our mission, we develop many of the software tools and models required to support a response to hazardous material spills. In the wake of the Deepwater horizon incident, we are embarking on a program to develop our next-generation oil spill transport model, taking into account lessons learned from years of response and this major incident. General Characteristics: The incumbent of this position will provide software development services to support the mission of the Emergency Response Division of NOAA's Office of Response and Restoration. As part of his/her efforts, independent evaluation and application of development techniques, algorithms, software architecture, and programming patterns will be required. The incumbent will work with the staff of ERD to provide analysis on user needs and software, GUI, and library design. He/she will be expect to work primarily on site at NOAA's facility in Seattle. Knowledge: The incumbent must be able to apply modern concepts of software engineering and design to the development of computational code, desktop applications, web applications, and libraries. The incumbent will need to be able to design, write, refactor, and implement code for a complex desktop and/or web application and computational library. The incumbent will work with a multi-disciplinary team including scientists, users, and other developers, utilizing software development practices such as usability design, version control, bug and issue tracking, and unit testing. Good communication skills and the knowledge of working as part of a team are required. Direction received: The incumbent will participate on various research and development teams. While endpoints will be identified through Division management and some direct supervision will be provided, the incumbent will be responsible for progressively being able to take input from team meetings and design objectives and propose strategies for reaching endpoints. Typical duties and responsibilities: The incumbent will work with the oil and chemical spill modeling team to improve and develop new tools and models used in fate and transport forecasting. Different components of the project will be written in C++, Python, and Javascript. Education requirement, minimum: Bachelor's degree in a technical discipline. Experience requirement, minimum: One to five years experience in development of complex software systems in one or more full-featured programming languages (C, C++, Java, Python, Ruby, Fortran, etc.) The team requires experience in the following languages/disciplines. Each incumbent will need experience in some subset: * Computational/Scientific programming * Numerical Analysis/Methods * Parallel processing * Desktop GUI * Web services * Web clients: HTML/CSS/Javascript * Python * wxPython * OpenGL * C/C++ * Python--C/C++ integration * Software development team leadership While the incumbent will work on-site at NOAA, directly with the NOAA team, this is a contract position with General Dynamics Information Technology: http://www.gdit.com/default.aspx For more information and to apply, use the GDIT web site: https://secure.resumeware.net/gdns_rw/gdns_web/job_detail.cfm?key=59436show_cart=0referredId=20 if that long url doesn't work, try: http://www.resumeware.net/gdns_rw/gdns_web/job_search.cfm and search for job ID: 179178 NOTE: This is a potion being hired by GDIT to work with NOAA, so any questions about salary, benefits, etc, etc should go to GDIT. However, feel free to send me questions about our organization, working conditions, more detail about the nature of the projects etc. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast robus implementation of float()
On 3/22/11 1:54 PM, Christian K. wrote: I wonder if someone has a good solution for a fast conversion of gridded ascii data to ndarray. the fastest out of the box way is with np.fromfile(input_file, sep= , dtype=np.float) It will only read multiple lines if the separater is whitespace, but it's pretty fast if it does. It should manage ',' as decimal point (on demand) I think that will work ig the locale is set right, though I don't know how to do that on demand and special windows numbers as 1.#INF. I can't remember if it does that -- give it a try. It does use ascii-to-float function written for numpy to handle things like that. Of course, this is easy to wrap in a small function but I expect it to be slow when the input size is in the Mb range. Never expect -- do a simple solution, then see if it's too slow for youre needs. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] avoid a line...
On 3/17/11 12:46 AM, eat wrote: I am try to read a file of the following format, I want to avoid the first line and read the remaining as 'float' . Please help me... RainWindTempPrSal 0.11.10.020.2 0.2 0.50. 0. 0.4 0.8 0.55.51.50.5 1.5 It's not as robust, but if performance matters, fromfile() should be faster: f.readline() # to skip the first line arr = np.fromfile(f, sep=' ', dtype=np.float64).reshape((-1, 5)) (untested) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] hashing dtypes, new variation, old theme
On 3/17/11 2:57 PM, Mark Wiebe wrote: Dtypes being mutable looks like a serious bug to me, it's violating the definition of 'hashable' given here: I can imagine other problems is would cause, as well -- is there any reason that dtypes should be mutable? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Nonzero behaving strangely?
On 3/17/11 3:17 PM, santhu kumar wrote: nid = (res[:,4]==2).nonzero() nid tuple turns out to be empty. But the very first row satisfies the criteria. it works for me: In [45]: arr Out[45]: array([[ 33.35053669, 49.4615004 , 44.27631299, 1., 2. ], [ 32.84263059, 50.24752036, 43.92291659, 1., 0. ], [ 33.68999668, 48.90554673, 43.51746687, 1., 0. ], [ 34.11564931, 49.77487763, 44.83843076, 1., 0. ], [ 32.4641859 , 48.65469145, 45.09300791, 1., 3. ], [ 32.15428526, 49.26922262, 45.92959026, 1., 0. ], [ 31.23860825, 48.21824628, 44.30816331, 1., 0. ], [ 30.71171138, 47.45600573, 44.9282456 , 1., 0. ], [ 30.53843426, 49.07713258, 44.20899822, 1., 0. ], [ 31.54722284, 47.61953925, 42.95235178, 1., 0. ], [ 32.44334635, 48.10500653, 42.51103537, 1., 0. ], [ 31.77269609, 46.53603145, 43.06468455, 1., 0. ], [ 30.1820843 , 47.80819604, 41.77667819, 1., 0. ], [ 30.78652668, 46.82907769, 40.38586451, 1., 0. ], [ 30.05963091, 46.84268609, 39.54583693, 1., 0. ], [ 31.75239177, 47.22768463, 40.00717713, 1., 0. ], [ 30.94617127, 45.76986265, 40.68226643, 1., 0. ], [ 33.20069679, 47.42127403, 45.66738249, 1., 0. ], [ 34.39608116, 47.25481126, 45.4438599 , 1., 0. ]]) In [46]: arr.shape Out[46]: (19, 5) In [47]: nid = (arr[:,4]==2).nonzero() In [48]: nid Out[48]: (array([0]),) Maybe you are having a floating point comparison problem -- i.e.e that 2.0 is really 1.99 or something. Checking equality with floating point numbers is fraught with problems. Also, y9ou amy not need teh nonzero: In [56]: arr[:,4]==2 Out[56]: array([ True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False], dtype=bool) that's a boolean array that can be used for indexing, operating with where, etc. HTH, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] *= operator not intuitive
On 3/16/11 9:22 AM, Paul Anton Letnes wrote: This comes up for discussion on a fairly regular basis. I tend towards the more warnings side myself, but you aren't going to get the current behavior changed unless you can convince a large bunch of people that it is the right thing to do, which won't be easy. Indeed, I don't think I'd want a warning for this -- though honestly, I'm not sure what you think should invoke a warning: a = np.ones((3,), dtype=np.float32) a + = 2.0 Should that be a warning? you are adding a 64 bit float to a 32bit float array. a = np.ones((3,), dtype=np.float32) a + = 2 now you are adding an integer -- should that be a warning? a = np.ones((3,), dtype=np.int32) a + = 2.1 now a float to an integer array -- should that be a warning? As you can see -- there are a lot of options. As there are defined as in-place operators, I don't expect any casting to occur. I think this is the kind of thing that would be great to have a warning the first time you do it, but once you understand, the warnings would be really, really annoying! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fortran was dead ... [was Re: rewriting NumPy code in C or C++ or similar]
On 3/15/11 8:33 AM, Charles R Harris wrote: There really isn't a satisfactory array library for C++. The fact that every couple of years there is another project to produce one testifies to that fact. And I think not just the fact that there is not one, but that perhaps C++ the language, or maybe the culture, simply doesn't support that way of thinking. I've been slowly arriving to the conclusion that that is no place for C++ in programming. If you really need to twiddle bits, use C. If you need high performance numerics, use Fortran. If you need high level complex data structures, use Python. And the fact that using all these in the same program is pretty easy makes it all possible. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ragged array implimentation
Francesc Alted wrote: Just a C pointer to a malloc'ed area (it cannot be an ndarray, as the data area might be compressed). got it. Ok. Thanks. The code is really simple, and that's great. However, carray is quite more sophisticated, as it supports not only enlarging arrays, but also shrinking, that's handy. and since 0.4 on, it also supports n- dimensional arrays (although you can only resize along the first dimension). I've been meaning to add that, too - not much to it really, but haven't had a need yet, so haven't gotten around to it. I hope it was a bit helpful, and I'm looking forward to seeing what's next for carray! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ragged array implimentation
On 3/7/11 5:51 PM, Sturla Molden wrote: Den 07.03.2011 18:28, skrev Christopher Barker: 1, 2, 3, 4 5, 6 7, 8, 9, 10, 11, 12 13, 14, 15 ... A ragged array, as implemented in C++, Java or C# is just an array of arrays (or 'a pointer to an array of pointers'). Sure, but as a rule I don't find direct translation of C++ or Java code to Pyton the best approach ;-) Basically, that is an ndarray of ndarrays (or a list of ndarrays, whatever you prefer). ra = np.zeros(4, dtype=np.ndarray) ra[0] = np.array([1,2,3,4]) ra[1] = np.array([5,6]) ra[2] = np.array([7,8,9,10,11,12]) ra[3] = np.array([13,14,15]) ra array([[1 2 3 4], [5 6], [ 7 8 9 10 11 12], [13 14 15]], dtype=object) ra[1][1] 6 ra[2][:] array([ 7, 8, 9, 10, 11, 12]) yup -- or I could use a list to store the rows, which would add the ability to append rows. Slicing in two dimensions does not work as some might expect: ra[:2][:2] array([[1 2 3 4], [5 6]], dtype=object) yup -- might want to overload indexing to do something smarter about that, though in myuse-case, slicing vertically isn't really useful anyway -- the nth element in one row doesn't neccessarily have anything to do with the nth element in another row. However, asside from the slicing syntax issue, what I lose with the approach is the ability to get reasonable performance on operations on the entire array: ra *= 3.3 Id like that to be numpy-efficient. What I need to grapple with is: 1) Is there a point to trying to build a general purpose ragged array? Or should I jsut build something that satisfies my use-case at hand? 2) What's the balance I need between performance and flexibility? putting the rows in a list give a lot more flexibility, putting it all in one 1-d numpy array could give better performance. NOTE: this looks like it could use a growable numpy array, much like one I've written before -- maybe it's time to revive that project and use it here, fixing some performance issues while I'm at it. Thanks for all your ideas, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ragged array implimentation
On 3/10/11 9:51 AM, Francesc Alted wrote: A Thursday 10 March 2011 18:05:11 Christopher Barker escrigué: NOTE: this looks like it could use a growable numpy array, much like one I've written before -- maybe it's time to revive that project and use it here, fixing some performance issues while I'm at it. A growable array would be indeed very useful in many situations. My carray project [1] is another attempt to create such an object. It grows (or shrink) arrays by adding (removing) chunks, so they are fast beasts for performing these operations. I suppose this is more or less like your approach. I don't think so -- my approach is a lot simpler that I think carray is: it acts much like a python list: 1) it pre-allocates extra space when an array is created. 2) when you append, it uses that extra space 3) when the extra space is used up, it re-allocates the entire array, with some more extra room And to keep it really simple, I'm using a numpy array internally, and np.ndarray.resize to grow it. It can only be grown on the first axis (C-order). IIUC, carray is doing a lot more smarts about keeping the array in chunks that can be individually manipulated, which is very cool.NOt to mention the compression! I also thought about implementing ragged arrays in carray, like those you are suggesting, but not sure when they will be ready for public consumption. That would be very cool, and I think you are right, the facilities in carray could make a really nice ragged array. Perhaps it is a good time to join efforts and create a nice 'growable' array package? Sure, but I don't know that I have much to offer, except maybe some test and timing code. By the way, it would be great to have a growable array that could be efficiently used in Cython code -- so you could accumulate a bunch of native C datatype data easily. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ragged array implimentation
On 3/10/11 11:29 AM, Christopher Barker wrote: By the way, it would be great to have a growable array that could be efficiently used in Cython code -- so you could accumulate a bunch of native C datatype data easily. I just noticed that carray is written in Cython, so that part should be easy! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy 1.6 schedule
On 3/6/11 5:54 AM, Charles R Harris wrote: I suppose this might cause a problem with lazy/quick c extensions that expected elements in a certain order, so some breakage could occur. absolutely! (I've gotten a bit confused about this thread, but if this is about the question of whether structured dtypes are dtypes are assumed to always keep order, read on -- otherwise, ignore).. I have always assumed a structured dtype was much like a C struct: i.e. arrangement in memory is fixed, and the named fields are just for convenience. I was very surprised to learn that they could ever be used in more a a dict-like manner. Also, I noted a recent post to this list suggesting that folks sometime miss-use stuctured arrays (particularly folks coming from MATLAB, etc), by using them when what they really want is a dict. In that post, the poster suggested that the primary use of structured arrays was for interacting with data files and C code that has arrays of structures, and indeed, that is how I have primarily used them. So I think for performance and consistency with C code, keeping dtype order is the best way to go. If that breaks backward compatibility too much, then it shouldn't be changed, but I think it better reflects what dtypes are all about. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ragged array implimentation
Hi folks, I'm setting out to write some code to access and work with ragged arrays stored in netcdf files. It dawned on me that ragged arrays are not all that uncommon, so I'm wondering if any of you have any code you've developed that I could learn-from borrow from, etc. note that when I say a ragged array, I mean a set of data where the each row could be a different arbitrary length: 1, 2, 3, 4 5, 6 7, 8, 9, 10, 11, 12 13, 14, 15 ... In my case, these will only be 2-d, though I suppose one could have a n-d version where the last dimension was ragged (or any dimension, I suppose, though I'm having trouble wrapping my brain around what that would look like... I'm not getting more specific about what I think the API should look like -- that is part of what I'm looking for suggestions, previous implementations, etc for. Is there any standard way to work with such data? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ragged array implimentation
On 3/7/11 9:33 AM, Francesc Alted wrote: A Monday 07 March 2011 18:28:11 Christopher Barker escrigué: I'm setting out to write some code to access and work with ragged arrays stored in netcdf files. It dawned on me that ragged arrays are not all that uncommon, so I'm wondering if any of you have any code you've developed that I could learn-from borrow from, etc. A list of numpy arrays would not be enough? Or you want something more specific? Maybe that would, but in mapping to the netcdf data model, I'm thinking more like a big 1-d numpy array, with a way to index into it. Also, that would allow you to do some math with the arrays, if the broad casting made sense, anyway. But now that you've entered the conversation, does HDF and/or pytables have a standard way of dealing with this? On 3/7/11 9:37 AM, Jeff Whitaker wrote: Chris: The netcdf4-python modules reads netcdf vlen arrays arethose a netcdf4 feature? So far, I'm still workign with netcdf3 -- though this could be a compelling reason to move on! We've talked about this some on the CF list, and I don't think anyone brought that up. and returns numpy object arrays, where the elements of the object arrays are themselves 1d numpy arrays. I don't think there is any other way to do it. In your example, the 'ragged' array would be a 1d numpy array with dtype='O', and the individual elements would be 1d numpy arrays with dtype=int. Of course, these arrays are very awkward to deal with and operations will be slow. Depends some of the operations, but yes. That still may be the best option, but I'm exploring others. is a vlen array stored contiguously in netcdf? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ragged array implimentation
On 3/7/11 11:18 AM, Francesc Alted wrote: Well, I don't think there is such a 'standard' way for dealing with ragged arrays, but yes, pytables has support for them. Creating them is easy: # Create a VLArray: fileh = tables.openFile('vlarray1.h5', mode='w') vlarray = fileh.createVLArray(fileh.root, 'vlarray1', tables.Int32Atom(shape=()), ragged array of ints, filters=tables.Filters(1)) # Append some (variable length) rows: vlarray.append(array([5, 6])) vlarray.append(array([5, 6, 7])) vlarray.append([5, 6, 9, 8]) Then, you can access the rows in a variety of ways, like iterators: print '--', vlarray.title for x in vlarray: print '%s[%d]-- %s' % (vlarray.name, vlarray.nrow, x) -- ragged array of ints vlarray1[0]-- [5 6] vlarray1[1]-- [5 6 7] vlarray1[2]-- [5 6 9 8] or via __getitem__, using general fancy indexing: a_row = vlarray[2] a_list = vlarray[::2] a_list2 = vlarray[[0,2]] # get list of coords a_list3 = vlarray[[0,-2]] # negative values accepted a_list4 = vlarray[numpy.array([True,...,False])] # array of bools but, instead of returning a numpy array of 'object' elements, plain python lists are returned instead. which gives you the append option -- I can see how that would be usefull. Though I'd kind of like to have numpy ufunc/broadcasting performance. i.e.: vlarray * some_constant Be fast, and work out of the box! More info on VLArray object in: http://www.pytables.org/docs/manual/ch04.html#VLArrayClassDescr is a vlen array stored contiguously in netcdf? great, thanks! that gives me an example of one API I might want to use. I don't really know, but one limitation of variable length arrays in HDF5 (and hence NetCDF4) is that they cannot be compressed (but that should be addressed in the future). good to know. Thanks to both you and Jeff, This has given me some things to ponder. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] rewriting NumPy code in C or C++ or similar
On 3/7/11 3:36 PM, Dan Halbert wrote: We currently have some straightforward NumPy code that indirectly implements a C API defined by a third party. We built a Cython layer that directly provides the API in a .a library, and then calls Python. The layering looks like this: C main program - API in Cython - Python - NumPy This is difficult to package for distribution, because of the Python and NumPy dependencies. I'd say learn py2exe, py2app and friends, and be done with it. Otherwise, I wonder if you could re-factor your Cython code enough to remove all python dependencies -- or at least have something you could compile and run without a python interpreter running. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to get the prices of Moving Averages Crosses?
) # dates_ma20_greater_than_ma50 = (dates[ma20_greater_than_ma50]) # prices_ma20_greater_than_ma50 = (prices[ma20_greater_than_ma50]) print dates_cross print prices_cross #print dates_ma20_greater_than_ma50 #print prices_ma20_greater_than_ma50 plot.plot(prices) plot.plot(ma20) plot.plot(ma50) plot.show() [/quote] Someone can give me some clues? Best Regards, ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt() for given number of rows?
On 2/14/11 8:28 AM, Stéfan van der Walt wrote: You can always read chunks of the file into StringIO objects, and pass those into loadtxt. true, but there isn't any single method for loading n lines of a file into a StringIO object, either. Ive always thought that file.readlines() should take a number of rows as an optional parameter. not a big deal, now that we have list comprehensions, but still it would be nice, and it makes sense to put it into loadtxt() for sure. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] conditional array indexing
On 2/14/11 2:39 PM, Bryan Woods wrote: Thanks for your reply. Unfortunately it is not working that simply. This tells me that only integer arrays with one element can be converted to an index Examples, example, examples! I think this is what you want: In [15]: land_cover Out[15]: array([[4, 2, 0, 4], [0, 2, 1, 1], [1, 1, 4, 2]]) In [16]: z0_legend Out[16]: array([ 3.4, 5.2, 1.3, 4.2, 6.4]) In [17]: roughness = z0_legend[land_cover] In [18]: roughness Out[18]: array([[ 6.4, 1.3, 3.4, 6.4], [ 3.4, 1.3, 5.2, 5.2], [ 5.2, 5.2, 6.4, 1.3]]) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] create a numpy array of images
It seems that using 64 bit python is the solution. It's certainly the easy way to access a lot of memory -- and memory is cheap these days. But the thing is i would compile my code and wanna distribute it to the clients.. I don't think 64 bit gets in the way of that -- except that it will only runon 64 bit systems, which may be an issue. only reason why i want to work on 32 bit system. Sturla, how I can make it sure that some part of the data is kept on the disk and only the necessary one in the memory; as this seems to be a solution to my problem. You can roll your own and have a disk cache of some sort -- it would be pretty easy to store each image in a *.npz file and load the up as you need them. But it would probably be even easier to use one of the hdf-based libarries, such as pytables -- I htink it will do it all for you. One other option, that I've never tried, is carray, which is an array compressed in memory. Depending on your images, perhaps they would compress a lot (or not ): https://github.com/FrancescAlted/carray http://mail.scipy.org/pipermail/numpy-discussion/2010-August/052378.html As i said i want a 3d visualization out of the numpy array. it works fine for the downsampled dataset. And to visualize, i have to convert the 16bit data into 8bit as PIL doesnt support 16 bit data. It's unclear to me what you native data really is: 16 bitgreyscale? 8bit greyscale? either one should fit OK into 32 bit memory, and if 8bit is accurate enough foryour needs, then it should be pretty easy. stack = numpy.empty((120, 1024, 1024)) numpy defaults to double precision float, np.float64, i.e. 8 bytes per element -- you probably don't want that if you are concerned about memoery use, and have 8 or 16 bit greyscale images. Try: stack = np.empty((120, 1024, 1024), dtype=np.uint8) # (or dtype=np.uint16) i = 0 os.chdir(dirr) for f in os.listdir(dirr): im = Image.open(f) im = im.convert(L) you ight want to try mode I. That should give you a 32 bit integer grey scale, which should hold all the 16 bit data without loss -- then you can convert to 16 bit when you bring it into numpy. a = numpy.asarray(im) print a.dtype what does this print? it should be np.uint8 stack[i] = a here, if a is a uint8, numpy will convert it to a float64, to fit into array a -- that's why you want to set the dtype of array a when you create it. one more thing, it really doesnt work for tiff files at all, i have to convert them into jpgs as a prior step to this. probably not the best choice either, jpeg is lossy -- is will smear things out a bit, which you may not want. It also only holds 24 bit RGB (I think), which is both a waste and will lose information from 16 bit greyscale. If you have to convert, try PNG, though I'm not sure if it handles 16 bit greyscale, either. I'd look at a lib that can read tiff properly -- some have been suggested here, and you can also use GDAL, which is meant for geo-referenced data, but you can ignore the geo information and just get an image if you want. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] create a numpy array of images
On 2/1/11 12:39 AM, Asmi Shah wrote: I have one more question: how to avoid the limitation of memoryerror in numpy. as I have like 200 images to stack in the numpy array of say 1024x1344 resolution.. have any idea apart from downsampling? If I'm doing my math right, that's 262 MB, shouldn't be a problem in modern systems. That's 8bit, but 786MB if 24 bit RGB. If you are careful about how many copies you're keeping around (including temporaries), you mau be OK still. But if you really have big collections of images, you might try memory mapped arrays -- as Sturla pointed out they wont' let you create monster arrays on a 32 bit python, but maybe they do help with not clogging up memory too much? I don't know -- I haven't used them -- presumably they have a purpose. Also, pytables is worth a look, as another way to get HDF5 on disk, but I think more natural access. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] create a numpy array of images
On 2/1/11 8:31 AM, Friedrich Romstedt wrote: In case you *have* to downsample: I also ran into this, with the example about my 5 images ... im.resize((newx newy), PIL.Image.ANTIALIAS) will be your friend. http://www.pythonware.com/library/pil/handbook/image.htm. If you want to downsample by a integer amount (i.e a factor of 2) in each dimension, I have some Cython code that optimizes that. I'm happy to send it along. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt() for given number of rows?
On 1/31/11 4:39 AM, Robert Cimrman wrote: I work with text files which contain several arrays separated by a few lines of other information, for example: POINTS 4 float -5.00e-01 -5.00e-01 0.00e+00 5.00e-01 -5.00e-01 0.00e+00 5.00e-01 5.00e-01 0.00e+00 -5.00e-01 5.00e-01 0.00e+00 CELLS 2 8 3 0 1 2 3 2 3 0 I have used custom Python code with loops to read similar files, so the speed was not too good. Now I wonder if it would be possible to use the numpy.loadtxt() function for the array-like parts. It supports passing an open file object in, which is good, but it wants to read the entire file, which does not work in this case. It seems to me, that an additional parameter to loadtxt(), say nrows or numrows, would do the job, I agree that that would be a useful feature. However, I'm not sure it would help performance much -- I think loadtxt is written in python as well. One option in the meantime. If you know how many rows, you presumable know how many items on each row. IN that case, you can use: np.fromfile(open_file, sep=' ', count=num_items_to_read) It'll only work for multi-line text if the separator is whitespace, which it was in your example. But if it does, it should be pretty fast. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] create a numpy array of images
On 1/28/11 7:01 AM, Asmi Shah wrote: I am using python for a while now and I have a requirement of creating a numpy array of microscopic tiff images ( this data is 3d, meaning there are 100 z slices of 512 X 512 pixels.) How can I create an array of images? It's quite straightforward to create a 3-d array to hold this kind of data: image_block = np.empty((100, 512, 512), dtype=??) now you can load it up by using some lib (PIL, or ???) to load the tif images, and then: for i in images: image_block[i,:,:] = i note that I put dtype to ??? up there. What dtype you want is dependent on what's in the tiff images -- tiff can hold just about anything. So if they are say, 16 bit greyscale, you'd want: dtype=np.uint16 if they are 24 bit rgb, you might want a custom dtype (I don't think there is a 24 bit dtype built in): RGB_type = np.dtype([('r',np.uint8),('g',np.uint8),('b',np.uint8)]) for 32 bit rgba, you can use the same approach, or just a 32 bit integer. The cool thing is that you can make views of this array with different dtypes, depending on what's easiest for the given use case. You can even break out the rgb parts into different axis: image_block = np.empty((100, 512, 512), dtype=RGB_type) image_block_rgb=image_block.view(dtype=np.uint8).reshape((100,512,512,3)) The two arrays now share the same data block, but you can look at them differently. I think this a really cool feature of numpy. i then would like to use visvis for visualizing this in 3D. you'll have to see what visvis is expecting in terms of data types, etc. HTH, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.append numpy.where vs list.append and brute iterative for loop
On 1/27/11 1:53 PM, Sturla Molden wrote: But N appends are O(N) for lists and O(N*N) for arrays. hmmm - that doesn't seem quite right -- lists still have to re-allocate and copy, they just do it every n times (where n grows with the list), so I wouldn't expect exactly O(N). But you never know 'till you profile. See the enclosed code and figures. Interestingly both appear to be pretty linear, though the constant is Much larger for numpy arrays. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov append_time.py Description: application/python attachment: append_timing.png___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Developer NumPy list versus User NumPy list
On 1/27/11 3:35 PM, Travis Oliphant wrote: What is the thought about having two separate NumPy lists (one for development discussions and one for user discussions)? Speaking as someone who hasn't contributed code to numpy itself, I still really like to follow the development discussion, so I'll subscribe to both lists anyway, and filter them to the same place in my email. So it makes little difference to me. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.append numpy.where vs list.append and brute iterative for loop
On 1/27/11 3:54 PM, Sturla Molden wrote: Lists allocate empty slots at their back, proportional to their size. So as lists grows, re-allocations become rarer and rarer. Then on average the complexity per append becomes O(1), which is the amortised complexity. Appending N items to a list thus has the amortized complexity O(N). I think I get that now... NumPy arrays are designed to be fixed size, and not designed to amortize the complexity of appends. So if you want to use arrays as efficient re-sizeable containers, you must code this logic yourself. And I do get that. And yet, experimentally, appending numpy arrays (on that one simple example) appeared to be O(N). Granted, a much larger constant that for lists, but it sure looks linear to me. Should it be O(N^2)? Maybe I need to run it for larger N , but I got impatient as it is. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Installing on CentOS 5 claims invalid Python installation
On 12/23/10 5:13 AM, Roger Martin wrote: NumPy looks like the way to get computation done in Python. yup -- welcome! Now I'm going through the learning curve of installing the module into different linux OS's and Python versions. hmm -- usually it's pretty straightforward on Linux (except maybe getting an optimized LAPACK, which you may or may not need). An extra need is to install google code's h5py http://code.google.com/p/h5py/ which depends on numpy. I'll leave that for the next step. In trying a number of Python versions the 2.x's are yielding the message invalid Python installation --- raise DistutilsPlatformError(my_msg) distutils.errors.DistutilsPlatformError: invalid Python installation: unable to open /home/roger/Python-2.6.6/dist/lib/python2.6/config/Makefile (No such file or directory) --- From reading on the web it appears a Python-2.x.x-devel version is needed. yup -- many of the Linux package systems split the stuff you need to run Python code from what you need to compile stuff against it -- common with other libs, packages as well. Yet no search combination comes back with where to get such a thing each distro has it's own naming convention -- look for anything like python-devel, python-dev, etc. (note: I need user installs/builds for security reasons). Ahh -- a different story -- AFAIK (and I'm NOT an expert) the distro's packages will install python into system directories -- if you really need each user to have their own install, you may need to install from source. That should be pretty straight forward, too. Get the source tarball from python.org, and follow the build instructions. You'll need to specify a user install in that process somehow. The latest numpy should work with any recent python -- If you are free to choose, use 2.7.1 -- it's the latest production version of the 2.* series. 3.* is still a bit bleeding edge. YOU can grab the tarball here: http://www.python.org/download/releases/2.7.1/ Once you've got a python working, a simple python setup.py install should do for numpy. HTH, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Can I add rows and columns to recarray?
On 12/5/10 7:56 PM, Wai Yip Tung wrote: I'm fairly new to numpy and I'm trying to figure out the right way to do things. Continuing on my question about using recarray as a relation. note that recarrays (or structured arrays, AFAIK, the difference is atturube access only -- I don't use recarrays) are far more static than a database table. So you may really want to use a database, or maybe pytables. Or maybe even just stick with lists. But if you are keeping things in memory, should be able to do what you want. In [339]: arr = np.array([ .: (1, 2.2, 0.0), .: (3, 4.5, 0.0) .: ], .: dtype=[ .: ('unit',int), .: ('price',float), .: ('amount',float), .: ] .: ) In [340]: data = arr.view(recarray) One of the most common thing I want to do is to append rows to data. numpy arrays do not naturally support appending, as you have discovered. I think concatenate() might be the method. yes. But I get a problem: In [342]: np.concatenate((data0,[1,9.0,9.0])) --- TypeError Traceback (most recent call last) c:\Python26\Lib\site-packages\numpy\ipython console inmodule() TypeError: expected a readable buffer object concatenate expects two arrays to be joined. If you pass in something that can easily be turned into an array, it will work, but a tuple can be converted to multiple types of arrays, so it doesn't know what to do. So you need to re-construct the second array: a2 = np.array( [(3,5.5, 3)], dtype=dt) arr = np.concatenate( (arr, a2) ) In [343]: data.amount = data.unit * data.price yup But sometimes it may require me to add a new column not already exist, e.g.: In [344]: data.discount_price = data.price * 0.9 How can I add a new column? you can't. what you need to do is create a new array with a new dtype that includes the new field. The trick is that numpy only supports homogenous arrays -- evey item is the same data type. So when you could a strut array like above, numpy does not define it as a 2-d table, but rather, a 1-d array, each element of which is a structure. so you need to do something like: # create a new array data2 = np.zeros(len(data), dtype=dt2) # fill the array: for field_name in dt.fields.keys(): data2[field_name] = data[field_name] # now some calculations: data2['discount_price'] = data2['price'] * 0.9 I don't know of a way to avoid that loop when filling the array. Better yet -- anticipate your needs and create the array with all the fields you need in the first place. You can see that ndarrays are pretty static -- struct arrays can be useful data storage, but are not very suitable when things are changing much. You could write a class that wraps an andarray, and supports what you need better -- it could be a pretty usefull general purpose class, too. I've got one that handle the appending part, but nothing with adding new fields. Here's appending with my class: data3 = accumulator.accumulator(dtype = dt2) data3.append((1, 2.2, 0.0, 0.0)) data3.append((3, 4.5, 0.0, 0.0)) data3.append((2, 1.2, 0.0, 0.0)) data3.append((5, 4.2, 0.0, 0.0)) print repr(data3) # convert to regular array for calculations: data3 = np.array(data3) # now some calculations: data3['discount_price'] = data3['price'] * 0.9 You wouldn't have to convert to a regular array, except that I haven't written the code to support field access yet -- I don't think it would be too hard, though. I've enclosed some test code, and my accumulator class, in case you find it useful. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov struct_test.py Description: application/python accumulator.py Description: application/python ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Can I add rows and columns to recarray?
On 12/6/10 11:00 AM, Benjamin Root wrote: numpy.lib.recfunctions has a method for easily adding new columns. cool! There is a lot of other nifty- looking stuff in there too. The OP should really take a look. And maybe an appending function is in order, too. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion