Re: [Numpy-discussion] Fastest way to compute summary statistics for a specific axis
Sebastian Berg sebastian at sipsolutions.net writes: On Mo, 2015-03-16 at 15:53 +, Dave Hirschfeld wrote: I have a number of large arrays for which I want to compute the mean and standard deviation over a particular axis - e.g. I want to compute the statistics for axis=1 as if the other axes were combined so that in the example below I get two values back In [1]: a = randn(30, 2, 1) For the mean this can be done easily like: In [2]: a.mean(0).mean(-1) Out[2]: array([ 0.0007, -0.0009]) If you have numpy 1.7+ (which I guess by now is basically always the case), you can do a.mean((0, 1)). Though it isn't actually faster in this example, probably because it has to use buffered iterators and things, but I would guess the performance should be much more stable depending on memory order, etc. then any other method. - Sebastian Wow, I didn't know you could even do that - that's very cool (and a lot cleaner than manually reordering reshaping) It seems to be pretty fast for me and reasonably stable wrt memory order: In [199]: %timeit a.mean(0).mean(-1) ...: %timeit a.mean(axis=(0,2)) ...: %timeit a.transpose(1,0,2).reshape(2, -1).mean(axis=-1) ...: %timeit a.transpose(2,0,1).reshape(-1, 2).mean(axis=0) ...: 1000 loops, best of 3: 1.52 ms per loop 1000 loops, best of 3: 1.5 ms per loop 100 loops, best of 3: 4.8 ms per loop 100 loops, best of 3: 14.6 ms per loop In [200]: a = a.copy('F') In [201]: %timeit a.mean(0).mean(-1) ...: %timeit a.mean(axis=(0,2)) ...: %timeit a.transpose(1,0,2).reshape(2, -1).mean(axis=-1) ...: %timeit a.transpose(2,0,1).reshape(-1, 2).mean(axis=0) 100 loops, best of 3: 2.02 ms per loop 100 loops, best of 3: 3.29 ms per loop 100 loops, best of 3: 7.18 ms per loop 100 loops, best of 3: 15.9 ms per loop Thanks, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Fastest way to compute summary statistics for a specific axis
I have a number of large arrays for which I want to compute the mean and standard deviation over a particular axis - e.g. I want to compute the statistics for axis=1 as if the other axes were combined so that in the example below I get two values back In [1]: a = randn(30, 2, 1) For the mean this can be done easily like: In [2]: a.mean(0).mean(-1) Out[2]: array([ 0.0007, -0.0009]) ...but this won't work for the std. Using some transformations we can come up with something which will work for either: In [3]: a.transpose(2,0,1).reshape(-1, 2).mean(axis=0) Out[3]: array([ 0.0007, -0.0009]) In [4]: a.transpose(1,0,2).reshape(2, -1).mean(axis=-1) Out[4]: array([ 0.0007, -0.0009]) If we look at the performance of these equivalent methods: In [5]: %timeit a.transpose(2,0,1).reshape(-1, 2).mean(axis=0) 100 loops, best of 3: 14.5 ms per loop In [6]: %timeit a.transpose(1,0,2).reshape(2, -1).mean(axis=-1) 100 loops, best of 3: 5.05 ms per loop we can see that the latter version is a clear winner. Investigating further, both methods appear to copy the data so the performance is likely down to better cache utilisation. In [7]: np.may_share_memory(a, a.transpose(2,0,1).reshape(-1, 2)) Out[7]: False In [8]: np.may_share_memory(a, a.transpose(1,0,2).reshape(2, -1)) Out[8]: False Both methods are however significantly slower than the initial attempt: In [9]: %timeit a.mean(0).mean(-1) 1000 loops, best of 3: 1.2 ms per loop Perhaps because it allocates a smaller temporary? For those who like a challenge: is there a faster way to achieve what I'm after? Cheers, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Optimizing numpy's einsum expression (again)
Daniel Smith dgasmith at icloud.com writes: Hello everyone,I originally brought an optimized einsum routine forward a few weeks back that attempts to contract numpy arrays together in an optimal way. This can greatly reduce the scaling and overall cost of the einsum expression for the cost of a few intermediate arrays. The current version (and more details) can be found here: https://github.com/dgasmith/opt_einsum I think this routine is close to a finalized version, but there are two problems which I would like the community to weigh in on: Thank you for your time, -Daniel Smith I wasn't aware of this work, but it's very interesting to me as a user of einsum whose principal reason for doing so is speed. Even though I use it on largish arrays I'm only concerned with the performance as I'm on x64 with plenty of memory even were it to require temporaries 5x the original size. I don't use einsum that much because I've noticed the performance can be very problem dependant so I've always profiled it to check. Hopefully this work will make the performance more consistent, allowing it to be used more generally throughout my code. Thanks, Dave * An anecdotal example from a user only. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] help using np.einsum for stacked matrix multiplication
Andrew Nelson writes: Dear list,I have a 4D array, A, that has the shape (NX, NY, 2, 2). I wish to perform matrix multiplication of the 'NY' 2x2 matrices, resulting in the matrix B. B would have shape (NX, 2, 2). I believe that np.einsum would be up to the task, but I'm not quite sure of the subscripts I would need to achieve this. Can anyone help, please? cheers, Andrew. Sorry, I'm a little unclear on what is supposed to be multiplied with what. You've got (NX x NY) 2x2 matricies, how do you end up with NX 2x2 matricies? Perhaps if you show the code using loops with `np.dot` it will be clearer how to translate to using `np.einsum` -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] segfault in np.arange
Julian Taylor jtaylor.debian at googlemail.com writes: On 23.10.2014 19:21, Dave Hirschfeld wrote: Hi, I accidentally passed a pandas DatetimeIndex to `np.arange` which caused it to segfault. It's a pretty dumb thing to do but I don't think it should cause a segfault! thanks for the report, this patch should fix it: https://github.com/numpy/numpy/pull/5225 Thanks for the super-fast patch! Hopefully it will make it into the next bugfix release, even if it is an apparently little used functionality. -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] segfault in np.arange
Hi, I accidentally passed a pandas DatetimeIndex to `np.arange` which caused it to segfault. It's a pretty dumb thing to do but I don't think it should cause a segfault! Python 2.7.5 |Continuum Analytics, Inc.| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)] on win32 Type help, copyright, credits or license for more information. # import faulthandler # faulthandler.enable() # import numpy as np # np.__version__ '1.9.0' # import pandas as pd # pd.__version__ '0.14.1' # np.arange(pd.DatetimeIndex([])) Fatal Python error: Segmentation fault Current thread 0x1d18: File stdin, line 1 in module The exception dialog which pops up contains the following information: Unhandled exception at 0x0255EB49 (multiarray.pyd) in python.exe: 0xC005: Access violation reading location 0x0008. Thanks, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Website down!
It seems that the docs website is down? http://docs.scipy.org/doc/ -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] simple way to denote unchanged dimension in reshape?
Chao YUE chaoyuejoy at gmail.com writes: Dear all, I have a simple question. Is there a way to denote the unchanged dimension in the reshape function? like suppose I have an array named arr having three dims with the first dimension length as 48, I want to reshape the first dim into 12*4, but keeping all the other dimension length unchanged. like when we slice the array, we can use: arr[10:40, ... ], ...' represents all remaining dimesions. however when doing reshape, we must use: arr.reshape(12,-1,arr.shape(1),arr.shape(2)) Is there something allowing more flexible reshape, like: arr.reshape(12,-1,...)? thanks a lot in advance,best, Chao For the example given the below code works: In [1]: x = randn(48,5,4,3,2) In [2]: x.reshape(12,-1,*x.shape[1:]).shape Out[2]: (12L, 4L, 5L, 4L, 3L, 2L) HTH, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fancy Indexing of Structured Arrays is Slow
Julian Taylor jtaylor.debian at googlemail.com writes: On 16.05.2014 10:59, Dave Hirschfeld wrote: Julian Taylor jtaylor.debian at googlemail.com writes: Yes, I'd heard about the improvements and am very excited to try them out since indexing is one of the bottlenecks in our algorithm. I made a PR with the simple change: https://github.com/numpy/numpy/pull/4721 improves it by the expected 50%, but its still 40% slower than the improved normal indexing. Having some problems building numpy to test this out, but assuming it does what it says on the tin I'd be very keen to get this in the impending 1.9 release if possible. Thanks, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fancy Indexing of Structured Arrays is Slow
Sebastian Berg sebastian at sipsolutions.net writes: On Do, 2014-05-15 at 12:31 +, Dave Hirschfeld wrote: As can be seen from the code below (or in the notebook linked beneath) fancy indexing of a structured array is twice as slow as indexing both fields independently - making it 4x slower? The non-vanilla types tend to be somewhat more efficient with these things and the first indexing does not copy so it is rather fast. I did not check the code, but we use (also in the new one for this operation) the copyswap function on individual elements (only for non-trivial copies in 1.9 in later, making the difference even larger), and this is probably not specialized to the specific void type so it probably has to do call the copyswap for every field (and first get the fields). All that work would be done for every element. If you are interested in this, you could check the fancy indexing inner loop and see if replacing the copyswap with the specialized strided transfer functions (it is used further down in a different branch of the loop) actually makes things faster. I would expect so for some void types anyway, but not sure in general. - Sebastian Thanks for the explanation and pointers - it sounds like a good opportunity for getting stuck into the internals of numpy which I've been meaning to do. I'm not sure I've got the required skills but I'm sure it will be a good learning experience. Unfortunately it won't likely be in the immediate future that'll I'll have the time to do so. In the meantime I can live with indexing the fields independently. Thanks, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fancy Indexing of Structured Arrays is Slow
Julian Taylor jtaylor.debian at googlemail.com writes: if ~50% faster is fast enough a simple improvement would be to replace the use of PyArg_ParseTuple with manual tuple unpacking. The PyArg functions are incredibly slow and is not required in VOID_copyswap which just extracts 'Oi. This 50% increase still makes it slower than the simpler indexing variant as these have been greatly improved in 1.9 (thanks to Sebastian for this :) ) Yes, I'd heard about the improvements and am very excited to try them out since indexing is one of the bottlenecks in our algorithm. -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Fancy Indexing of Structured Arrays is Slow
As can be seen from the code below (or in the notebook linked beneath) fancy indexing of a structured array is twice as slow as indexing both fields independently - making it 4x slower? I found that fancy indexing was a bottleneck in my application so I was hoping to reduce the overhead by combining the arrays into a structured array and only doing one indexing operation. Unfortunately that doubled the time that it took! Is there any reason for this? If not, I'm happy to open an enhancement issue on GitHub - just let me know. Thanks, Dave In [32]: nrows, ncols = 365, 1 In [33]: items = np.rec.fromarrays(randn(2,nrows, ncols), names= ['widgets','gadgets']) In [34]: row_idx = randint(0, nrows, ncols) ...: col_idx = np.arange(ncols) In [35]: %timeit filtered_items = items[row_idx, col_idx] 100 loops, best of 3: 3.45 ms per loop In [36]: %%timeit ...: widgets = items['widgets'][row_idx, col_idx] ...: gadgets = items['gadgets'][row_idx, col_idx] ...: 1000 loops, best of 3: 1.57 ms per loop http://nbviewer.ipython.org/urls/gist.githubusercontent.com/dhirschfeld/98b9 970fb68adf23dfea/raw/10c0f968ea1489f0a24da80d3af30de7106848ac/Slow%20Structu red%20Array%20Indexing.ipynb https://gist.github.com/dhirschfeld/98b9970fb68adf23dfea ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Dates and times and Datetime64 (again)
Sankarshan Mudkavi smudkavi at uwaterloo.ca writes: Hey all, It's been a while since the last datetime and timezones discussion thread was visited (linked below): http://thread.gmane.org/gmane.comp.python.numeric.general/53805 It looks like the best approach to follow is the UTC only approach in the linked thread with an optional flag to indicate the timezone (to avoid confusing applications where they don't expect any timezone info). Since this is slightly more useful than having just a naive datetime64 package and would be open to extension if required, it's probably the best way to start improving the datetime64 library. snip I would like to start writing a NEP for this followed by implementation, however I'm not sure what the format etc. is, could someone direct me to a page where this information is provided? Please let me know if there are any ideas, comments etc. Cheers, Sankarshan See: http://article.gmane.org/gmane.comp.python.numeric.general/55191 You could use a current NEP as a template: https://github.com/numpy/numpy/tree/master/doc/neps I'm a huge +100 on the simplest UTC fix. As is, using numpy datetimes is likely to silently give incorrect results - something I've already seen several times in end-user data analysis code. Concrete Example: In [16]: dates = pd.date_range('01-Apr-2014', '04-Apr-2014', freq='H')[:-1] ...: values = np.array([1,2,3]).repeat(24) ...: records = zip(map(str, dates), values) ...: pd.TimeSeries(values, dates).groupby(lambda d: d.date()).mean() ...: Out[16]: 2014-04-011 2014-04-022 2014-04-033 dtype: int32 In [17]: df = pd.DataFrame(np.array(records, dtype=[('dates', 'M8[h]'), ('values', float)])) ...: df.set_index('dates', inplace=True) ...: df.groupby(lambda d: d.date()).mean() ...: Out[17]: values 2014-03-31 1.00 2014-04-01 1.041667 2014-04-02 2.041667 2014-04-03 3.00 [4 rows x 1 columns] Try it in your timezone and see what you get! -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Dates and times and Datetime64 (again)
Jeff Reback jeffreback at gmail.com writes: Dave, your example is not a problem with numpy per se, rather that the default generation is in local timezone (same as what python datetime does). If you localize to UTC you get the results that you expect. The problem is that the default datetime generation in *numpy* is in local time. Note that this *is not* the case in Python - it doesn't try to guess the timezone info based on where in the world you run the code, if it's not provided it sets it to None. In [7]: pd.datetime? Type: type String Form:type 'datetime.datetime' Docstring: datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]) The year, month and day arguments are required. tzinfo may be None, or an instance of a tzinfo subclass. The remaining arguments may be ints or longs. In [8]: pd.datetime(2000,1,1).tzinfo is None Out[8]: True This may be the best solution but as others have pointed out this is more difficult to implement and may have other issues. I don't want to wait for the best solution - the assume UTC on input/output if not specified will solve the problem and this desperately needs to be fixed because it's completely broken as is IMHO. If you localize to UTC you get the results that you expect. That's the whole point - *numpy* needs to localize to UTC, not to whatever timezone you happen to be in when running the code. In a real-world data analysis problem you don't start with the data in a DataFrame or a numpy array it comes from the web, a csv, Excel, a database and you want to convert it to a DataFrame or numpy array. So what you have from whatever source is a list of tuples of strings and you want to convert them into a typed array. Obviously you can't localize a string - you have to convert it to a date first and if you do that with numpy the date you have is wrong. In [108]: dst = np.array(['2014-03-30 00:00', '2014-03-30 01:00', '2014-03- 30 02:00'], dtype='M8[h]') ...: dst ...: Out[108]: array(['2014-03-30T00+', '2014-03-30T00+', '2014-03- 30T02+0100'], dtype='datetime64[h]') In [109]: dst.tolist() Out[109]: [datetime.datetime(2014, 3, 30, 0, 0), datetime.datetime(2014, 3, 30, 0, 0), datetime.datetime(2014, 3, 30, 1, 0)] AFAICS there's no way to get the original dates back once they've passed through numpy's parser!? -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] svd error checking vs. speed
alex argriffi at ncsu.edu writes: Hello list, Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion. The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input. [...] the following numpy code hangs until I `kill -9` it. ``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2 import numpy as np np.__version__ '1.9.0.dev-e3f0f53' A = np.array([[1e3, 0], [0, 1]]) B = np.array([[1e300, 0], [0, 1]]) C = np.array([[1e3000, 0], [0, 1]]) np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.e+300, 1.e+000]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(C) [hangs forever] ``` Alex I'm -1 on checking finiteness - if there's one place you usually want maximum performance it's linear algebra operations. It certainly shouldn't crash or hang though and for me at least it doesn't - it returns NaN which immediately suggests to me that I've got bad input (maybe just because I've seen it before). I'm not sure adding an extra kwarg is worth cluttering up the api when a simple call to isfinite beforehand will do the job if you think you may potentially have non-finite input. Python 2.7.5 |Anaconda 1.8.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)] In [1]: import numpy as np In [2]: A = np.array([[1e3, 0], [0, 1]]) ...: B = np.array([[1e300, 0], [0, 1]]) ...: C = np.array([[1e3000, 0], [0, 1]]) ...: np.linalg.svd(A) ...: Out[2]: (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) In [3]: np.linalg.svd(B) Out[3]: (array([[ 1., 0.], [ 0., 1.]]), array([ 1.e+300, 1.e+000]), array([[ 1., 0.], [ 0., 1.]])) In [4]: C Out[4]: array([[ inf, 0.], [ 0., 1.]]) In [5]: np.linalg.svd(C) Out[5]: (array([[ 0., 1.], [ 1., 0.]]), array([ nan, nan]), array([[ 0., 1.], [ 1., 0.]])) In [6]: np.__version__ Out[6]: '1.7.1' Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] svd error checking vs. speed
Sturla Molden sturla.molden at gmail.com writes: josef.pktd at gmail.com wrote: I use official numpy release for development, Windows, 32bit python, i.e. MingW 3.5 and whatever old ATLAS the release includes. a constant 13% cpu usage is 1/8 th of my 8 virtual cores. Based on this and Alex' message it seems the offender is the f2c generated lapack_lite library. So do we do with lapack_lite? Should we patch it? Sturla Even if lapack_lite always performed the isfinite check and threw a python error if False, it would be much better than either hanging or segfaulting and people who care about the isfinite cost probably would be linking to a fast lapack anyway. -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy 1.9 release date
Ralf Gommers ralf.gommers at gmail.com writes: On Fri, Nov 8, 2013 at 8:22 PM, Charles R Harris charlesr.harris at gmail.com wrote: and think that the main thing missing at this point is fixing the datetime problems. Is anyone planning to work on this? If yes, you need a rough estimate of when this is ready to go. If no, it needs to be decided if this is critical for the release. From the previous discussion I tend to think so. If it's critical but no one does it, why plan a release... Ralf Just want to pipe up here as to the criticality of datetime bug. Below is a minimal example from some data analysis code I found in our company that was giving incorrect results (fortunately it was caught by thorough testing): In [110]: records = [ ...: ('2014-03-29 23:00:00', '2014-03-29 23:00:00'), ...: ('2014-03-30 00:00:00', '2014-03-30 00:00:00'), ...: ('2014-03-30 01:00:00', '2014-03-30 01:00:00'), ...: ('2014-03-30 02:00:00', '2014-03-30 02:00:00'), ...: ('2014-03-30 03:00:00', '2014-03-30 03:00:00'), ...: ('2014-10-25 23:00:00', '2014-10-25 23:00:00'), ...: ('2014-10-26 00:00:00', '2014-10-26 00:00:00'), ...: ('2014-10-26 01:00:00', '2014-10-26 01:00:00'), ...: ('2014-10-26 02:00:00', '2014-10-26 02:00:00'), ...: ('2014-10-26 03:00:00', '2014-10-26 03:00:00')] ...: ...: ...: data = np.asarray(records, dtype=[('date obj', 'M8[h]'), ('str repr', object)]) ...: df = pd.DataFrame(data) In [111]: df Out[111]: date obj str repr 0 2014-03-29 23:00:00 2014-03-29 23:00:00 1 2014-03-30 00:00:00 2014-03-30 00:00:00 2 2014-03-30 00:00:00 2014-03-30 01:00:00 3 2014-03-30 01:00:00 2014-03-30 02:00:00 4 2014-03-30 02:00:00 2014-03-30 03:00:00 5 2014-10-25 22:00:00 2014-10-25 23:00:00 6 2014-10-25 23:00:00 2014-10-26 00:00:00 7 2014-10-26 01:00:00 2014-10-26 01:00:00 8 2014-10-26 02:00:00 2014-10-26 02:00:00 9 2014-10-26 03:00:00 2014-10-26 03:00:00 Note the local timezone adjusted `date obj` including the duplicate value at the clock-change in March and the missing value at the clock-change in October. As you can imagine this could very easily lead to incorrect analysis. If running this exact same code in the (Eastern) US you'd see the following results: date obj str repr 0 2014-03-30 03:00:00 2014-03-29 23:00:00 1 2014-03-30 04:00:00 2014-03-30 00:00:00 2 2014-03-30 05:00:00 2014-03-30 01:00:00 3 2014-03-30 06:00:00 2014-03-30 02:00:00 4 2014-03-30 07:00:00 2014-03-30 03:00:00 5 2014-10-26 03:00:00 2014-10-25 23:00:00 6 2014-10-26 04:00:00 2014-10-26 00:00:00 7 2014-10-26 05:00:00 2014-10-26 01:00:00 8 2014-10-26 06:00:00 2014-10-26 02:00:00 9 2014-10-26 07:00:00 2014-10-26 03:00:00 Unfortunately I don't have the skills to meaningfully contribute in this area but it is a very real problem for users of numpy, many of whom are not active on the mailing list. HTH, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] add .H attribute?
josef.pktd at gmail.com writes: I think a H is feature creep and too specialized What's .H of a int a str a bool ? It's just .T and a view, so you cannot rely that conj() makes a copy if you don't work with complex. .T is just a reshape function and has **nothing** to do with matrix algebra. It seems to me that that ship has already sailed - i.e. conj doesn't make much sense for str arrays, but it still works in the sense that it's a nop In [16]: A = asarray(list('abcdefghi')).reshape(3,3) ...: np.all(A.T == A.conj().T) ...: Out[16]: True If we're voting my vote goes to add the .H attribute for all the reasons Alan has specified. Document that it returns a copy but that it may in future return a view so it it not future proof to operate on the result inplace. I'm -1 on .H() as it will require code changes if it ever changes to a property and it will simply result in questions about why .T is a property and .H is a function (and why it's a property for (sparse) matrices) Regarding Dag's example: xh = x.H x *= 2 assert np.all(2 * xh == x.H) I'm sceptical that there's much code out there actually relying on the fact that a transpose is a view with the specified intention of altering the original array inplace. I work with a lot of beginners and whenever I've seen them operate inplace on a transpose it has been a bug in the code, leading to a discussion of how, for performance reasons, numpy will return a view where possible, leading to yet further discussion of when it is and isn't possible to return a view. The third option of .H returning a view would probably be agreeable to everyone but I don't think we should punt on this decision for something that if it does happen is likely years away. It seems that work on this front is happening in different projects to numpy. Even if for example sometime in the future numpy's internals were replaced with libdynd or other expression graph engine surely this would result in more breaking changes than .H returning a view rather than a copy?! IANAD so I'm happy with whatever the consensus is I just thought I'd put forward the view from a (specific type of) user perspective. Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] add .H attribute?
Nathaniel Smith njs at pobox.com writes: As soon as you talk about attributes returning things you've already broken Python's mental model... attributes are things that sit there, not things that execute arbitrary code. Of course this is not how the actual implementation works, attribute access *can* in fact execute arbitrary code, but the mental model is important, so we should preserve it where-ever we can. Just mentioning an attribute should not cause unbounded memory allocations. Yep, sorry - sloppy use of terminology which I agree is important in helping understand what's happening. -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] add .H attribute?
Alan G Isaac alan.isaac at gmail.com writes: On 7/22/2013 3:10 PM, Nathaniel Smith wrote: Having .T but not .H is an example of this split. Hate to do this but ... Readability counts. +10! A.conjugate().transpose() is unspeakably horrible IMHO. Since there's no way to avoid a copy you gain nothing by not providing the convenience function. It should be fairly obvious that an operation which changes the values of an array (and doesn't work in-place) necessarily takes a copy. I think it's more than sufficient to simply document the fact that A.H will return a copy. A user coming from Matlab probably doesn't care that it takes a copy but you'd be hard pressed to convince them there's any benefit of writing A.conjugate().transpose() over exactly what it looks like in textbooks - A.H Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] datetime64 constructor ignores dtype argument?
The example below demonstrates the fact that the datetime64 constructor ignores the dtype argument if passed in. Is this conscious design decision or a bug/oversight? In [55]: from datetime import datetime ...: d = datetime.now() ...: In [56]: d Out[56]: datetime.datetime(2013, 6, 12, 11, 3, 27, 861000) In [57]: np.datetime64(d) Out[57]: numpy.datetime64('2013-06-12T12:03:27.861000+0100') In [58]: np.datetime64(d).dtype Out[58]: dtype('M8[us]') In [59]: np.datetime64(d, dtype='M8[D]') Out[59]: numpy.datetime64('2013-06-12T12:03:27.861000+0100') In [60]: np.datetime64(d, dtype='M8[D]').dtype Out[60]: dtype('M8[us]')??? In [61]: np.datetime64(d).astype('M8[D]') Out[61]: numpy.datetime64('2013-06-12') Thanks, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 1.8 release
Charles R Harris charlesr.harris at gmail.com writes: Hi All,I think it is time to start the runup to the 1.8 release. I don't know of any outstanding blockers but if anyone has a PR/issue that they feel needs to be in the next Numpy release now is the time to make it known.Chuck It would be good to get the utc-everywhere fix for datetime64 in there if someone has time to look into it. Thanks, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
Andreas Hilboll lists at hilboll.de writes: I think your point about using current timezone in interpreting user input being dangerous is probably correct --- perhaps UTC all the way would be a safer (and simpler) choice? +1 +10 from me! I've recently come across a bug due to the fact that numpy interprets dates as being in the local timezone. The data comes from a database query where there is no timezone information supplied (and dates are stored as strings). It is assumed that the user doesn't need to know the timezone - i.e. the dates are timezone naive. Working out the correct timezones would be fairly laborious, but whatever the correct timezones are, they're certainly not the timezone the current user happens to find themselves in! e.g. In [32]: rs = [ ...: (u'2000-01-17 00:00:00.00', u'2000-02-01', u'2000-02-29', 0.1203), ...: (u'2000-01-26 00:00:00.00', u'2000-02-01', u'2000-02-29', 0.1369), ...: (u'2000-01-18 00:00:00.00', u'2000-03-01', u'2000-03-31', 0.1122), ...: (u'2000-02-25 00:00:00.00', u'2000-03-01', u'2000-03-31', 0.1425) ...: ] ...: dtype = [('issue_date', 'datetime64[ns]'), ...: ('start_date', 'datetime64[D]'), ...: ('end_date', 'datetime64[D]'), ...: ('value', float)] ...: # In [33]: # What I see in London, UK ...: recordset = np.array(rs, dtype=dtype) ...: df = pd.DataFrame(recordset) ...: df = df.set_index('issue_date') ...: df ...: Out[33]: start_dateend_date value issue_date 2000-01-17 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1203 2000-01-26 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1369 2000-01-18 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1122 2000-02-25 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1425 In [34]: # What my colleague sees in Auckland, NZ ...: recordset = np.array(rs, dtype=dtype) ...: df = pd.DataFrame(recordset) ...: df = df.set_index('issue_date') ...: df ...: Out[34]: start_dateend_date value issue_date 2000-01-16 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1203 2000-01-25 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1369 2000-01-17 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1122 2000-02-24 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1425 Oh dear! This isn't acceptable for my use case (in a multinational company) and I found no reasonable way around it other than bypassing the numpy conversion entirely by setting the dtype to object, manually parsing the strings and creating an array from the list of datetime objects. Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
Nathaniel Smith njs at pobox.com writes: On Wed, Apr 3, 2013 at 2:26 PM, Dave Hirschfeld dave.hirschfeld at gmail.com wrote: This isn't acceptable for my use case (in a multinational company) and I found no reasonable way around it other than bypassing the numpy conversion entirely by setting the dtype to object, manually parsing the strings and creating an array from the list of datetime objects. Wow, that's truly broken. I'm sorry. I'm skeptical that just switching to UTC everywhere is actually the right solution. It smells like one of those solutions that's simple, neat, and wrong. (I don't know anything about calendar-time series handling, so I have no ability to actually judge this stuff, but wouldn't one problem be if you want to know about business days/hours? You lose the original day-of-year once you move everything to UTC.) Maybe datetime dtypes should be parametrized by both granularity and timezone? Or we could just declare that datetime64 is always timezone-naive and adjust the code to match? I'll CC the pandas list in case they have some insight. Unfortunately AFAIK no-one who's regularly working on numpy this point works with datetimes, so we have limited ability to judge solutions... please help! -n It think simply setting the timezone to UTC if it's not specified would solve 99% of use cases because IIUC the internal representation is UTC so numpy would be doing no conversion of the dates that were passed in. It was the conversion which was the source of the error in my example. The only potential issue with this is that the dates might take along an incorrect UTC timezone, making it more difficult to work with naive datetimes. e.g. In [42]: d = np.datetime64('2014-01-01 00:00:00', dtype='M8[ns]') In [43]: d Out[43]: numpy.datetime64('2014-01-01T00:00:00+') In [44]: str(d) Out[44]: '2014-01-01T00:00:00+' In [45]: pydate(str(d)) Out[45]: datetime.datetime(2014, 1, 1, 0, 0, tzinfo=tzutc()) In [46]: pydate(str(d)) == datetime.datetime(2014, 1, 1) Traceback (most recent call last): File ipython-input-46-abfc0fee9b97, line 1, in module pydate(str(d)) == datetime.datetime(2014, 1, 1) TypeError: can't compare offset-naive and offset-aware datetimes In [47]: pydate(str(d)) == datetime.datetime(2014, 1, 1, tzinfo=tzutc()) Out[47]: True In [48]: pydate(str(d)).replace(tzinfo=None) == datetime.datetime(2014, 1, 1) Out[48]: True In this case it may be best to have numpy not try to set the timezone at all if none was specified. Given the internal representation is UTC I'm not sure this is feasible though so defaulting to UTC may be the best solution. Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] New numpy functions: filled, filled_like
Robert Kern robert.kern at gmail.com writes: One alternative that does not expand the API with two-liners is to let the ndarray.fill() method return self: a = np.empty(...).fill(20.0) This violates the convention that in-place operations never return self, to avoid confusion with out-of-place operations. E.g. ndarray.resize() versus ndarray.reshape(), ndarray.sort() versus np.sort(), and in the broader Python world, list.sort() versus sorted(), list.reverse() versus reversed(). (This was an explicit reason given for list.sort to not return self, even.) Maybe enabling this idiom is a good enough reason to break the convention (Special cases aren't special enough to break the rules. / Although practicality beats purity), but it at least makes me -0 on this... I tend to agree with the notion that inplace operations shouldn't return self, but I don't know if it's just because I've been conditioned this way. Not returning self breaks the fluid interface pattern [1], as noted in a similar discussion on pandas [2], FWIW, though there's likely some way to have both worlds. Ah-hah, here's the email where Guide officially proclaims that there shall be no fluent interface nonsense applied to in-place operators in Python, because it hurts readability (at least for Dutch people ): http://mail.python.org/pipermail/python-dev/2003-October/038855.html That's a statement about the policy for the stdlib, and just one person's opinion. You, and numpy, are permitted to have a different opinion. In any case, I'm not strongly advocating for it. It's violation of principle (no fluent interfaces) is roughly in the same ballpark as np.filled() (not every two-liner needs its own function), so I thought I would toss it out there for consideration. -- Robert Kern FWIW I'm +1 on the idea. Perhaps because I just don't see many practical downsides to breaking the convention but I regularly see a big issue with there being no way to instantiate an array with a particular value. The one obvious way to do it is use ones and multiply by the value you want. I work with a lot of inexperienced programmers and I see this idiom all the time. It takes a fair amount of numpy knowledge to know that you should do it in two lines by using empty and setting a slice. In [1]: %timeit NaN*ones(1) 1000 loops, best of 3: 1.74 ms per loop In [2]: %%timeit ...: x = empty(1, dtype=float) ...: x[:] = NaN ...: 1 loops, best of 3: 28 us per loop In [3]: 1.74e-3/28e-6 Out[3]: 62.142857142857146 Even when not in the mythical tight loop setting an array to one and then multiplying uses up a lot of cycles - it's nearly 2 orders of magnitude slower than what we know they *should* be doing. I'm agnostic as to whether fill should be modified or new functions provided but I think numpy is currently missing this functionality and that providing it would save a lot of new users from shooting themselves in the foot performance- wise. -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug in as_strided/reshape
Sebastian Berg sebastian at sipsolutions.net writes: Hello, looking at the code, when only adding/removing dimensions with size 1, numpy takes a small shortcut, however it uses 0 stride lengths as value for the new one element dimensions temporarily, then replacing it again to ensure the new array is contiguous. This replacing does not check if the dimension has more then size 1. Likely there is a better way to fix it, but the attached diff should do it. Regards, Sebastian Thanks for the confirmation. So this doesn't get lost I've opened issue #380 on GitHub https://github.com/numpy/numpy/issues/380 -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Second try: possible bug in assignment to complex array
Mark Bakker markbak at gmail.com writes: I think there is a problem with assigning a 1D complex array of length one to a position in another complex array. Example: a = ones(1,'D') b = ones(1,'D') a[0] = b --- TypeError Traceback (most recent call last) ipython-input-37-0c4fc6d780e3 in module() 1 a[0] = b TypeError: can't convert complex to float This works correctly when a and b are real arrays: a = ones(1) b = ones(1) a[0] = b Bug or feature? Thanks, Mark I can't help unfortunately, but I can confirm that I also see the problem on Win32 Python 2.7.3, numpy 1.6.2. As a workaround it appears that slicing works: In [15]: sys.version Out[15]: '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]' In [16]: sys.version Out[16]: '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]' In [17]: np.__version__ Out[17]: '1.6.2' In [18]: a = ones(1,'D') In [19]: b = 2*ones(1,'D') In [20]: a[0] = b --- TypeError Traceback (most recent call last) ipython-input-20-0c4fc6d780e3 in module() 1 a[0] = b TypeError: can't convert complex to float In [21]: a[0:1] = b In [22]: -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug in as_strided/reshape
Dave Hirschfeld dave.hirschfeld at gmail.com writes: It seems that reshape doesn't work correctly on an array which has been resized using the 0-stride trick e.g. In [73]: x = array([5]) In [74]: y = as_strided(x, shape=(10,), strides=(0,)) In [75]: y Out[75]: array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]) In [76]: y.reshape([10,1]) Out[76]: array([[ 5], [ 8], [ 762933412], [-2013265919], [ 26], [ 64], [ 762933414], [-2013244356], [ 26], [ 64]]) Should all be 5 In [77]: y.copy().reshape([10,1]) Out[77]: array([[5], [5], [5], [5], [5], [5], [5], [5], [5], [5]]) In [78]: np.__version__ Out[78]: '1.6.2' Perhaps a clause such as below is required in reshape? if any(stride == 0 for stride in y.strides): return y.copy().reshape(shape) else: return y.reshape(shape) Regards, Dave Though it would be good to avoid the copy which you should be able to do in this case. Investigating further: In [15]: y.strides Out[15]: (0,) In [16]: z = y.reshape([10,1]) In [17]: z.strides Out[17]: (4, 4) In [18]: z.strides = (0, 4) In [19]: z Out[19]: array([[5], [5], [5], [5], [5], [5], [5], [5], [5], [5]]) In [32]: y.reshape([5, 2]) Out[32]: array([[5, 5], [5, 5], [5, 5], [5, 5], [5, 5]]) In [33]: y.reshape([5, 2]).strides Out[33]: (0, 0) So it seems that reshape is incorrectly setting the stride of axis0 to 4, but only when the appended axis is of size 1. -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Bug in as_strided/reshape
It seems that reshape doesn't work correctly on an array which has been resized using the 0-stride trick e.g. In [73]: x = array([5]) In [74]: y = as_strided(x, shape=(10,), strides=(0,)) In [75]: y Out[75]: array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5]) In [76]: y.reshape([10,1]) Out[76]: array([[ 5], [ 8], [ 762933412], [-2013265919], [ 26], [ 64], [ 762933414], [-2013244356], [ 26], [ 64]]) Should all be 5 In [77]: y.copy().reshape([10,1]) Out[77]: array([[5], [5], [5], [5], [5], [5], [5], [5], [5], [5]]) In [78]: np.__version__ Out[78]: '1.6.2' Perhaps a clause such as below is required in reshape? if any(stride == 0 for stride in y.strides): return y.copy().reshape(shape) else: return y.reshape(shape) Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Moving lib.recfunctions?
Pierre GM pgmdevlist at gmail.com writes: Hello, The idea behin having a lib.recfunctions and not a rec.recfunctions or whatever was to illustrate that the functions of this package are more generic than they appear. They work with regular structured ndarrays and don't need recarrays. Methinks we gonna lose this aspect if you try to rename it, but hey, your call. I've never really thought there's much distinction between the two - AFAICT a recarray is just a structured array with attribute access? If a function only accepts a recarray (are there any?) isn't it just a simple call to .view(np.recarray) to get it to work with structured arrays? Because of this view I've always thought functions which worked on either should be grouped together. As to as why they were never really advertised ? Because I never received any feedback when I started developing them (developing is a big word here, I just took a lot of code that John D Hunter had developed in matplotlib and make it more consistent). I advertised them once or twice on the list, wrote the basic docstrings, but waited for other people to start using them. Anyhow. So, yes, there might be some weird import to polish. Note that if you decided to just rename the package and leave it where it was, it would probably be easier. The path of least resistance is to just import lib.recfunctions.* into the (already crowded) main numpy namespace and be done with it. Why ? Why can't you leave it available through numpy.lib ? Once again, if it's only a matter of PRing, you could start writing an entry page in the doc describing the functions, that would improve the visibility. I do recall them being advertised a while ago, but when I came to look for them I couldn't find them - IMHO np.rec is a much more intuitive (and nicer/shorter) namespace than np.lib.recfunctions. I think having similar functionality in two completely different namespaces is confusing hard to remember. It also doesn't help that np.lib.recfunctions isn't discoverable by t ab-completion: In [2]: np.lib.rec np.lib.recfromcsv np.lib.recfromtxt ...of course you could probably find it with np.lookfor but it's one more barrier to their use. FWIW I'd be happy if the np.lib.recfunctions fuctions were made available in the np.rec namespace (and possibly deprecate np.lib.recfunctions to avoid confusion?) I'm conscious that as a user (not a developer) talk is cheap and I'm happy with whatever the consensus is. I just thought I'd pipe up since it was only through this thread that I re-discovered np.lib.recfunctions! HTH, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Wes McKinney wesmckinn at gmail.com writes: - Fundamental need to be able to work with multiple time series, especially performing operations involving cross-sectional data - I think it's a bit hard for lay people to use (read: ex-MATLAB/R users). This is just my opinion, but a few years ago I thought about using it and concluded that teaching people how to properly use it (a precision tool, indeed!) was going to cause me grief. - The data alignment problem, best explained in code: - Inability to derive custom offsets: I can do: In [14]: ts.shift(2, offset=2 * datetools.BDay()) Out[14]: 2000-01-11 00:00:00 0.0503706684002 2000-01-18 00:00:00 -1.7660004939 2000-01-25 00:00:00 1.11716758554 2000-02-01 00:00:00 -0.171029995265 2000-02-08 00:00:00 -0.99876580126 2000-02-15 00:00:00 -0.262729046405 or even generate, say, 5-minutely or 10-minutely date ranges thusly: In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00', offset=datetools.Minute(5)) Out[16]: class 'pandas.core.daterange.DateRange' offset: 5 Minutes, tzinfo: None [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00] length: 85 It would be nice to have a step argument in the date_array function. The following works though: In [96]: integers = r_[ts.Date('T',01-Aug-2011 05:00).value: ts.Date('T',06-Aug-2011 12:01).value: 5] In [97]: ts.date_array(integers, freq='T') Out[97]: DateArray([01-Aug-2011 05:00, 01-Aug-2011 05:05, 01-Aug-2011 05:10, ..., 06-Aug-2011 11:45, 06-Aug-2011 11:50, 06-Aug-2011 11:55, ..., 06-Aug-2011 12:00], freq='T') - (possible now??) Ability to have a set of frequency-naive dates (possibly not in order). This last point actually matters. Suppose you wanted to get the worst 5-performing days in the SP 500 index: In [7]: spx.index Out[7]: class 'pandas.core.daterange.DateRange' offset: 1 BusinessDay, tzinfo: None [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00] length: 2963 # but this is OK In [8]: spx.order()[:5] Out[8]: 2008-10-15 00:00:00 -0.0903497960942 2008-12-01 00:00:00 -0.0892952780505 2008-09-29 00:00:00 -0.0878970494885 2008-10-09 00:00:00 -0.0761670761671 2008-11-20 00:00:00 -0.0671229140321 - W Seems like you're looking for the tssort function: In [90]: series = ts.time_series(randn(365),start_date='01-Jan-2011',freq='D') In [91]: def tssort(series): : indices = series.argsort() : return ts.time_series(series._series[indices], : series.dates[indices], : autosort=False) : In [92]: tssort(series)[0:5] Out[92]: timeseries([-2.96602612 -2.81612524 -2.61558511 -2.59522921 -2.4899447 ], dates = [26-Aug-2011 18-Apr-2011 27-Aug-2011 21-Aug-2011 19-Nov-2011], freq = D) Pandas seems to offer an even higher level api than scikits.timeseries. I view them as mostly complementary but I haven't (yet) had much experience with pandas... -Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
Mark Wiebe mwwiebe at gmail.com writes: It appears to me that a structured dtype with some further NumPy extensions could entirely replace the 'events' metadata fairly cleanly. If the ufuncs are extended to operate on structured arrays, and integers modulo n are added as a new dtype, a dtype like [('date', 'M8[D]'), ('event', 'i8[mod 100]')] could replace the current 'M8[D]//100'. Sounds like a cleaner API. As Dave H. summarized, we used a basic keyword to do the same thing in scikits.timeseries, with the addition of some subfrequencies like A-SEP to represent a year starting in September, for example. It works, but it's really not elegant a solution. This kind of thing definitely belongs in a layer above datetime. That's fair enough - my perspective as a timeseries user is probably a lot higher level. My main aim is to point out some of the higher level uses so that the numpy dtype can be made compatible with them - I'd hate to have a situation where we have multiple different datetime representations in the different libraries and having to continually convert at the boundaries. That said, I think the starting point for a series at a weekly, quarterly or annual frequency/unit/type is something which may need to be sorted out at the lowest level... One overall impression I have about timeseries in general is the use of the term frequency synonymously with the time unit. To me, a frequency is a numerical quantity with a unit of 1/(time unit), so while it's related to the time unit, naming it the same is something the specific timeseries domain has chosen to do, I think the numpy datetime class shouldn't have anything called frequency in it, and I would like to remove the current usage of that terminology from the codebase. It seems that it's just a naming convention (possibly not the best) and can be used synonymously with the time unit/resolution/dtype I don't envision 'asfreq' being a datetime function, this is the kind of thing that would layer on top in a specialized timeseries library. The behavior of timedelta follows a more physics-like idea with regard to the time unit, and I don't think something more complicated belongs at the bottom layer that is shared among all datetime uses. I think since freq == dtype then asfreq == astype. From your examples it seems to do the same thing - i.e. if you go to a lower resolution (freq) representation the higher resolution information is truncated - e.g. a monthly resolution date has no information about days/hours/minutes/seconds etc. It's converting in the other direction: low -- high resolution where the difference lies - numpy always converts to the start of the interval whereas the timeseries Date class gives you the option of the start or the end. I'm thinking of a datetime as an infinitesimal moment in time, with the unit representing the precision that can be stored. Thus, '2011', '2011-01', and '2011-01-01T00:00:00.00Z' all represent the same moment in time, but with a different unit of precision. Computationally, this perspective is turning out to provide a pretty rich mechanism to do operations on date. I think this is where the main point of difference is. I use the timeseries as a container for data which arrives in a certain interval of time. e.g. monthly temperatures are the average temperature over the interval defined by the particular month. It's not the temperature at the instant of time that the month began, or ended or of any particular instant in that month. Thus the conversion from a monthly resolution to say a daily resolution isn't well defined and the right thing to do is likely to be application specific. For the average temperature example you may want to choose a value in the middle of the month so that you don't introduce a phase delay if you interpolate between the datapoints. If for example you measured total rainfall in a month you might want to choose the last day of the month to represent the total rainfall for that month as that was the only date in the interval where the total rainfall did in fact equal the monthly value. It may be as you say though that all this functionality does belong at a higher level... Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fixing up datetime
As a user of numpy/scipy in finance I thought I would put in my 2p worth as it's something which is of great importance in this area. I'm currently a heavy user of the scikits.timeseries package by Matt Pierre and I'm also following the development of statsmodels and pandas should we require more sophisticated statistics in future. Hopefully the numpy datetime type will provide a foundation such packages can build upon... I'll use the timeseries package for reference since I'm most familiar with it and it's a very good api for my requirements. Apologies to Matt/Pierre if I get anything wrong - feel free to correct my misconceptions... I think some of the complexity is coming from the definition of the timedelta. In the timeseries package each date simply represents the number of periods since the epoch and the difference between dates is therefore just and integer with no attached metadata - its meaning is determined by the context it's used in. e.g. In [56]: M1 = ts.Date('M',01-Jan-2011) In [57]: M2 = ts.Date('M',01-Jan-2012) In [58]: M2 - M1 Out[58]: 12 timeseries gets on just fine without a timedelta type - a timedelta is just an integer and if you add an integer to a date it's interpreted as the number of periods of that dates frequency. From a useability point of view M1 + 1 is much nicer than having to do something like M1 + ts.TimeDelta(M1.freq, 1). Something like the dateutil relativedelta pacage is very convenient and could serve as a template for such functionality: In [59]: from dateutil.relativedelta import relativedelta In [60]: (D1 + 30).datetime Out[60]: datetime.datetime(2011, 1, 31, 0, 0) In [61]: (D1 + 30).datetime + relativedelta(months=1) Out[61]: datetime.datetime(2011, 2, 28, 0, 0) ...but you can still get the same behaviour without a timedelta by asking that the user explicitly specify what they mean by adding one month to a date of a different frequency. e.g. In [62]: (D1 + 30) Out[62]: D : 31-Jan-2011 In [63]: _62.asfreq('M') + 1 Out[63]: M : Feb-2011 In [64]: (_62.asfreq('M') + 1).asfreq('D','END') Out[64]: D : 28-Feb-2011 In [65]: (_62.asfreq('M') + 1).asfreq('D','START') + _62.day Out[65]: D : 04-Mar-2011 As Pierre noted when converting dates from a lower frequency to a higher one it's very useful (essential!) to be able to specify whether you want the end or the start of the interval. It may also be useful to be able to specify an arbitrary offset from either the start or the end of the interval so you could do something like: In [66]: (_62.asfreq('M') + 1).asfreq('D', offset=0) Out[66]: D : 01-Feb-2011 In [67]: (_62.asfreq('M') + 1).asfreq('D', offset=-1) Out[67]: D : 28-Feb-2011 In [68]: (_62.asfreq('M') + 1).asfreq('D', offset=15) Out[68]: D : 16-Feb-2011 I don't think it's useful to define higher 'frequencies' as arbitrary multiples of lower 'frequencies' unless the conversion is exact otherwise it leads to the following inconsistencies: In [69]: days_per_month = 30 In [70]: D1 = M1.asfreq('D',relation='START') In [71]: D2 = M2.asfreq('D','START') In [72]: D1, D2 Out[72]: (D : 01-Jan-2011, D : 01-Jan-2012) In [73]: D1 + days_per_month*(M2 - M1) Out[73]: D : 27-Dec-2011 In [74]: D1 + days_per_month*(M2 - M1) == D2 Out[74]: False If I want the number of days between M1 and M2 I explicitely do the conversion myself: In [75]: M2.asfreq('D','START') - M1.asfreq('D','START') Out[75]: 365 thus avoiding any inconsistency: In [76]: D1 + (M2.asfreq('D','START') - M1.asfreq('D','START')) == D2 Out[76]: True I'm not convinced about the events concept - it seems to add complexity for something which could be accomplished better in other ways. A [Y]//4 dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There may well be a good reason for it however I can't see the need for it in my own applications. In the timeseries package, because the difference between dates represents the number of periods between the dates they must be of the same frequency to unambiguopusly define what a period means: In [77]: M1 - D1 --- ValueErrorTraceback (most recent call last) C:\dev\code\ipython console in module() ValueError: Cannot subtract Date objects with different frequencies. I would argue that in the spirit that we're all consenting adults adding dates of the same frequency can be a useful thing for example in finding the mid-point between two dates: In [78]: M1.asfreq('S','START') Out[78]: S : 01-Jan-2011 00:00:00 In [79]: M2.asfreq('S','START') Out[79]: S : 01-Jan-2012 00:00:00 In [80]: ts.Date('S', (_64.value + _65.value)//2) Out[80]: S : 02-Jul-2011 12:00:00 I think any errors which arose from adding or multiplying dates would be pretty easy to spot in your code. As Robert mentioned the economic data we use is often supplied as weekly, monthly, quarterly or annual data. So these frequencies are critical if we're to use the
Re: [Numpy-discussion] fixing up datetime
Robert Kern robert.kern at gmail.com writes: On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld dave.hirschfeld at gmail.com wrote: I'm not convinced about the events concept - it seems to add complexity for something which could be accomplished better in other ways. A [Y]//4 dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There may well be a good reason for it however I can't see the need for it in my own applications. Well, [D/100] doesn't represent [864s]. It represents something that happens 100 times a day, but not necessarily at precise regular intervals. For example, suppose that I am representing payments that happen twice a month, say on the 1st and 15th of every month, or the 5th and 20th. I would use [M/2] to represent that. It's not [2W], and it's not [15D]. It's twice a month. The default conversions may seem to imply that [D/100] is equivalent to [864s], but they are not intended to. They are just a starting point for one to write one's own, more specific conversions. Similarly, we have default conversions from low frequencies to high frequencies defaulting to representing the higher precision event at the beginning of the low frequency interval. E.g. for days-seconds, we assume that the day is representing the initial second at midnight of that day. We then use offsets to allow the user to add more information to specify it more precisely. That would be one way of dealing with irregularly spaced data. I would argue that the example is somewhat back-to-front though. If something happens twice a month it's not occuring at a monthly frequency, but at a higher frequency. In this case the lowest frequency which can capture this data is daily frequency so it sould be stored at daily frequency and if monthly statistics are required the series can be aggregated up after the fact. e.g. In [2]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M') In [3]: dates = dates.asfreq('D','START') In [4]: dates = (dates[:,None] + np.tile(array([4, 19]), [12, 1])).ravel() In [5]: data = 100 + 10*randn(12, 2) In [6]: payments = ts.time_series(data.ravel(), dates) In [7]: payments Out[7]: timeseries([ 103.76588849 101.29566771 91.10363573 101.90578443 102.12588909 89.86413807 94.89200485 93.69989375 103.37375202 104.7628273 97.45956699 93.39594431 94.79258639 102.90656477 87.42346985 91.43556069 95.21947628 93.0671271 107.07400065 92.0835356 94.11035154 86.66521318 109.36556861 101.69789341], dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 20-Mar-2011 05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011 05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011 05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011], freq = D) In [8]: payments.convert('M', ma.count) Out[8]: timeseries([2 2 2 2 2 2 2 2 2 2 2 2], dates = [Jan-2011 ... Dec-2011], freq = M) In [9]: payments.convert('M', ma.sum) Out[9]: timeseries([205.061556202 193.009420163 191.990027161 188.591898598 208.136579315 190.855511303 197.699151161 178.859030538 188.286603379 199.157536259 180.775564724 211.063462017], dates = [Jan-2011 ... Dec-2011], freq = M) Alternatively for a fixed number of events per-period the values can just be stored in a 2D array - e.g. In [10]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M') In [11]: payments = ts.time_series(data, dates) In [12]: payments Out[12]: timeseries( [[ 103.76588849 101.29566771] [ 91.10363573 101.90578443] [ 102.12588909 89.86413807] [ 94.89200485 93.69989375] [ 103.37375202 104.7628273 ] [ 97.45956699 93.39594431] [ 94.79258639 102.90656477] [ 87.42346985 91.43556069] [ 95.21947628 93.0671271 ] [ 107.07400065 92.0835356 ] [ 94.11035154 86.66521318] [ 109.36556861 101.69789341]], dates = [Jan-2011 ... Dec-2011], freq = M) In [13]: payments.sum(1) Out[13]: timeseries([ 205.0615562 193.00942016 191.99002716 188.5918986 208.13657931 190.8555113 197.69915116 178.85903054 188.28660338 199.15753626 180.77556472 211.06346202], dates = [Jan-2011 ... Dec-2011], freq = M) It seems to me that either of these would satisfy the use-case with the added benefit of simplifying the datetime implementation. That said I'm not against the proposal if it provides someone with some benefit... Regarding the default conversions, the start of the interval is a perfectly acceptable default (and my preferred choice) however being able to specify the end is also useful as for the month-day conversion the end of each month can't be specified by a fixed offset from the start because of their varying lengths. Of course this can be found by subtracting 1 from the start of the next month: (M + 1).asfreq('D',offset=0) - 1 but just as it's easier to write List[-1] rather than List[List.length -1] it's easier to write M.asfreq('D
Re: [Numpy-discussion] fixing up datetime
Christopher Barker Chris.Barker at noaa.gov writes: Dave Hirschfeld wrote: That would be one way of dealing with irregularly spaced data. I would argue that the example is somewhat back-to-front though. If something happens twice a month it's not occuring at a monthly frequency, but at a higher frequency. And that frequency is 2/month. In this case the lowest frequency which can capture this data is daily frequency so it should be stored at daily frequency I don't think it should, because it isn't 1/15 days, or, indeed, an frequency that can be specified as days. Sure you can specify the 5th and 20th of each month in a given time span in terms of days since an epoch, but you've lost some information there. When you want to do math -- like add a certain number of 1/2 months -- when is the 100th payment due? It seems keeping it in M/2 is the most natural way to deal with that -- then you don't need special code to do that addition, only when converting to a string (or other format) datetime. With a monthly frequency you can't represent 2/month except using a 2D array as in my second example. This does however discard the information about when in the month the payments occurred/are due. For that matter I'm not sure where the information that the events occur on the 5th and the 20th is recorded in the dtype [M]//2? Maybe my mental model is too fixed in the scikits.timeseries mode and I need to try out the new code... You're right that 2 per month isn't a daily frequency however as you noted the data can be stored in a daily frequency array with missing values for each date where a payment doesn't occur. The timeseries package uses a masked array to represent the missing data however you aren't required to have data for every day, only when you call the .fill_missing_dates() function is the array expanded to include a datapoint for every period. As shown below the timeseries is at a daily frequency but only datapoints for the 5th and 20th of the month are included: In [21]: payments Out[21]: timeseries([ 103.76588849 101.29566771 91.10363573 101.90578443 102.125889 89.86413807 94.89200485 93.69989375 103.37375202 104.7628273 97.45956699 93.39594431 94.79258639 102.90656477 87.42346985 91.43556069 95.21947628 93.0671271 107.07400065 92.0835356 94.11035154 86.66521318 109.36556861 101.69789341], dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 20-Mar-2011 05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011 05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011 05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011], freq = D) If I want to see when the 5th payment is due I can simply do: In [26]: payments[4:5] Out[26]: timeseries([ 102.12588909], dates = [05-Mar-2011], freq = D) Advancing the payments by a fixed number of days is possible: In [28]: payments.dates[:] += 3 In [29]: payments.dates Out[29]: DateArray([08-Jan-2011, 23-Jan-2011, 08-Feb-2011, 23-Feb-2011, 08-Mar-2011, 23-Mar-2011, 08-Apr-2011, 23-Apr-2011, 08-May-2011, 23-May-2011, 08-Jun-2011, 23-Jun-2011, 08-Jul-2011, 23-Jul-2011, 08-Aug-2011, 23-Aug-2011, 08-Sep-2011, 23-Sep-2011, 08-Oct-2011, 23-Oct-2011, 08-Nov-2011, 23-Nov-2011, 08-Dec-2011, 23-Dec-2011], freq='D') Starting 3 payments in the future is more difficult and would require the date array to be recreated with the new starting date. One way of doing that would be: In [42]: dates = ts.date_array(payments.dates[2], length=(31*payments.size)//2) In [43]: dates = dates[(dates.day == 8) | (dates.day == 23)][0:payments.size] In [44]: dates Out[44]: DateArray([08-Feb-2011, 23-Feb-2011, 08-Mar-2011, 23-Mar-2011, 08-Apr-2011, 23-Apr-2011, 08-May-2011, 23-May-2011, 08-Jun-2011, 23-Jun-2011, 08-Jul-2011, 23-Jul-2011, 08-Aug-2011, 23-Aug-2011, 08-Sep-2011, 23-Sep-2011, 08-Oct-2011, 23-Oct-2011, 08-Nov-2011, 23-Nov-2011, 08-Dec-2011, 23-Dec-2011, 08-Jan-2012, 23-Jan-2012], freq='D') In [45]: dates.shape Out[45]: (24,) A bit messy - I'll have to look at the numpy implementation to how it improves the situation... Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Changing the datetime operation unit rules
Mark Wiebe mwwiebe at gmail.com writes: a = np.datetime64('today') a - a.astype('M8[Y]') numpy.timedelta64(157,'D') vs a = np.datetime64('today') a - a.astype('M8[Y]') Traceback (most recent call last): File stdin, line 1, in module TypeError: ufunc subtract cannot use operands with types dtype('datetime64[D]') and dtype('datetime64[Y]') Cheers, Mark I think I like the latter behaviour actually. The first one implicitly assumes I want the yearly frequency converted to the start of the period. I think the user should have to explicitly convert the dates to a common representation before performing arithmetic on them. It seems to me that if you didn't want your data to be silently cast to the start of the period it would be very difficult to determine if this had happened or not. As an example consider the pricing of a stock using the black-scholes formula. The value is dependant on the time to maturity which is the expiry date minus the current date. If the expiry date was accidentally specified at (or converted to) a monthly frequency and the current date was a daily frequency taking the difference will give you the wrong number of days until expiry resulting in a subtly different answer - not good. NB: The timeseries Date/DateArray have a day_of_year attribute which is very useful. Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy speed question
Jean-Luc Menut jeanluc.menut at free.fr writes: I have a little question about the speed of numpy vs IDL 7.0. Here the IDL result: % Compiled module: $MAIN$. 2.837 The python code: from numpy import * from time import time time1 = time() for j in range(1): for i in range(1000): a=cos(2*pi*i/100.) time2 = time() print time2-time1 result: In [2]: run python_test_speed.py 24.180943 Whilst you've imported everything from numpy you're not really using numpy - you're still using a slow Python (double) loop. The power of numpy comes from vectorising your code - i.e. applying functions to arrays of data. The example below demonstrates an 80 fold increase in speed by vectorising the calculation: def method1(): a = empty([1000, 1]) for j in range(1): for i in range(1000): a[i,j] = cos(2*pi*i/100.) return a # def method2(): ij = np.repeat((2*pi*np.arange(1000)/100.)[:,None], 1, axis=1) return np.cos(ij) # In [46]: timeit method1() 1 loops, best of 3: 47.9 s per loop In [47]: timeit method2() 1 loops, best of 3: 589 ms per loop In [48]: allclose(method1(), method2()) Out[48]: True ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to import input data to make nda rray for batch processing?
Venkat dvr002 at gmail.com writes: Hi All,I am new to Numpy (also Scipy).I am trying to reshape my text data which is in one single column (10,000 rows).I want the data to be in 100x100 array form.I have many files to convert like this. All of them are having file names like 0, 1, 2, 500. with out any extension. Actually, I renamed actual files so that I can import them in Matlab for batch processing.Since Matlab also new for me, I thought I will try Numpy first. Can any body help me in writing the script to do this for making batch processing. Thanks in advance,Venkat In [2]: dummy_data = np.random.randn(100,100) In [3]: dummy_data.shape Out[3]: (100, 100) In [4]: dummy_data.flatten().shape Out[4]: (1,) In [5]: np.savetxt('dummy_data.txt', dummy_data.flatten()) In [6]: !less dummy_data.txt 2.571031186906808100e-01 1.566790681796508500e+00 -6.846267829937818800e-01 3.271332705287631200e-01 -7.482409829656505600e-02 1.429095432454441600e-01 -6.41694801869400e-01 -5.319842186383831900e-01 -4.047786844569442600e-01 -6.696045994533519300e-01 -4.895085917712171400e-01 6.969419814656118200e-01 6.656815445278644300e-01 7.455135053686292600e-01 ... In [7]: data = np.loadtxt('dummy_data.txt') In [8]: data.shape Out[8]: (1,) In [9]: data = data.reshape(100, 100) In [10]: data.shape Out[10]: (100, 100) In [11]: np.allclose(dummy_data, data) Out[11]: True HTH, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Printing formatted numerical values
pv+numpy at math.duke.edu writes: Hi, what is the best way to print (to a file or to stdout) formatted numerical values? Analogously to C's printf(%d %g,x,y) etc? For stdout you can simply do: In [26]: w, x, y, z = np.randint(0,100,4) In [27]: type(w) Out[27]: type 'numpy.int32' In [28]: print(%f %g %e %d % (w,x,y,z,)) 68.00 64 1.30e+01 57 In [29]: w, x, y, z Out[29]: (68, 64, 13, 57) For a file I would use np.savetxt HTH, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unpleasant behavior with poly1d and numpy scalar multiplication
Charles R Harris charlesr.harris at gmail.com writes: I was also thinking that someone might want to provide a better display at some point, drawing on a canvas, for instance. And what happens when the degree gets up over 100, which is quite reasonable with the Cheybshev polynomials? There may well be better ways to do it but I've found the following function to be quite handy for visualising latex equations: def eqview(expr,fontsize=28,dpi=80): IS_INTERACTIVE = is_interactive() try: interactive(False) fig = figure(dpi=dpi, facecolor='w') h = figtext(0.5, 0.5, latex, fontsize = fontsize, horizontalalignment = 'center', verticalalignment = 'center') bbox = h.get_window_extent(RendererAgg(15,15,dpi)) fig.set_size_inches(1.1*bbox.width/dpi, 1.25*bbox.height/dpi) show() finally: interactive(IS_INTERACTIVE) NB: Sympy provides the latex function to convert the equation objects into latex as well as other ways to display the objects in the sympy.printing module. It shouldn't be too hard to do something similar if someone was so inclined! HTH, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion