from:"Dave Hirschfeld"

Re: [Numpy-discussion] Fastest way to compute summary statistics for a specific axis

2015-03-17 Thread Dave Hirschfeld

Sebastian Berg sebastian at sipsolutions.net writes:

 
 On Mo, 2015-03-16 at 15:53 +, Dave Hirschfeld wrote:
  I have a number of large arrays for which I want to compute the mean 
and 
  standard deviation over a particular axis - e.g. I want to compute 
the 
  statistics for axis=1 as if the other axes were combined so that in 
the 
  example below I get two values back
  
  In [1]: a = randn(30, 2, 1)
  
  For the mean this can be done easily like:
  
  In [2]: a.mean(0).mean(-1)
  Out[2]: array([ 0.0007, -0.0009])
  
 
 If you have numpy 1.7+ (which I guess by now is basically always the
 case), you can do a.mean((0, 1)). Though it isn't actually faster in
 this example, probably because it has to use buffered iterators and
 things, but I would guess the performance should be much more stable
 depending on memory order, etc. then any other method.
 
 - Sebastian
 


Wow, I didn't know you could even do that - that's very cool (and a lot 
cleaner than manually reordering  reshaping)

It seems to be pretty fast for me and reasonably stable wrt memory 
order:

In [199]: %timeit a.mean(0).mean(-1)
 ...: %timeit a.mean(axis=(0,2))
 ...: %timeit a.transpose(1,0,2).reshape(2, -1).mean(axis=-1)
 ...: %timeit a.transpose(2,0,1).reshape(-1, 2).mean(axis=0)
 ...: 
1000 loops, best of 3: 1.52 ms per loop
1000 loops, best of 3: 1.5 ms per loop
100 loops, best of 3: 4.8 ms per loop
100 loops, best of 3: 14.6 ms per loop

In [200]: a = a.copy('F')

In [201]: %timeit a.mean(0).mean(-1)
 ...: %timeit a.mean(axis=(0,2))
 ...: %timeit a.transpose(1,0,2).reshape(2, -1).mean(axis=-1)
 ...: %timeit a.transpose(2,0,1).reshape(-1, 2).mean(axis=0)
100 loops, best of 3: 2.02 ms per loop
100 loops, best of 3: 3.29 ms per loop
100 loops, best of 3: 7.18 ms per loop
100 loops, best of 3: 15.9 ms per loop


Thanks,
Dave


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Fastest way to compute summary statistics for a specific axis

2015-03-16 Thread Dave Hirschfeld

I have a number of large arrays for which I want to compute the mean and 
standard deviation over a particular axis - e.g. I want to compute the 
statistics for axis=1 as if the other axes were combined so that in the 
example below I get two values back

In [1]: a = randn(30, 2, 1)

For the mean this can be done easily like:

In [2]: a.mean(0).mean(-1)
Out[2]: array([ 0.0007, -0.0009])


...but this won't work for the std. Using some transformations we can 
come up with something which will work for either:

In [3]: a.transpose(2,0,1).reshape(-1, 2).mean(axis=0)
Out[3]: array([ 0.0007, -0.0009])

In [4]: a.transpose(1,0,2).reshape(2, -1).mean(axis=-1)
Out[4]: array([ 0.0007, -0.0009])


If we look at the performance of these equivalent methods:

In [5]: %timeit a.transpose(2,0,1).reshape(-1, 2).mean(axis=0)
100 loops, best of 3: 14.5 ms per loop

In [6]: %timeit a.transpose(1,0,2).reshape(2, -1).mean(axis=-1)
100 loops, best of 3: 5.05 ms per loop


we can see that the latter version is a clear winner. Investigating 
further, both methods appear to copy the data so the performance is 
likely down to better cache utilisation.

In [7]: np.may_share_memory(a, a.transpose(2,0,1).reshape(-1, 2))
Out[7]: False

In [8]: np.may_share_memory(a, a.transpose(1,0,2).reshape(2, -1))
Out[8]: False


Both methods are however significantly slower than the initial attempt:

In [9]: %timeit a.mean(0).mean(-1)
1000 loops, best of 3: 1.2 ms per loop

Perhaps because it allocates a smaller temporary?

For those who like a challenge: is there a faster way to achieve what 
I'm after?


Cheers,
Dave









___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Optimizing numpy's einsum expression (again)

2015-01-16 Thread Dave Hirschfeld

Daniel Smith dgasmith at icloud.com writes:

 
 Hello everyone,I originally brought an optimized einsum routine 
forward a few weeks back that attempts to contract numpy arrays together 
in an optimal way. This can greatly reduce the scaling and overall cost 
of the einsum expression for the cost of a few intermediate arrays. The 
current version (and more details) can be found here:
 https://github.com/dgasmith/opt_einsum
 
 I think this routine is close to a finalized version, but there are 
two problems which I would like the community to weigh in on:
 
 Thank you for your time,
 
 
 -Daniel Smith
 

I wasn't aware of this work, but it's very interesting to me as a user 
of einsum whose principal reason for doing so is speed. Even though I 
use it on largish arrays I'm only concerned with the performance as I'm 
on x64 with plenty of memory even were it to require temporaries 5x the 
original size.

I don't use einsum that much because I've noticed the performance can be 
very problem dependant so I've always profiled it to check. Hopefully 
this work will make the performance more consistent, allowing it to be 
used more generally throughout my code.

Thanks,
Dave


* An anecdotal example from a user only.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] help using np.einsum for stacked matrix multiplication

2014-10-29 Thread Dave Hirschfeld

Andrew Nelson writes:

 
 Dear list,I have a 4D array, A, that has the shape (NX, NY, 2, 2).  I 
wish to perform matrix multiplication of the 'NY' 2x2 matrices, resulting 
in the matrix B.  B would have shape (NX, 2, 2).  I believe that np.einsum 
would be up to the task, but I'm not quite sure of the subscripts I would 
need to achieve this.
 
 Can anyone help, please?
 
 cheers,
 Andrew.

Sorry, I'm a little unclear on what is supposed to be multiplied with 
what. You've got (NX x NY) 2x2 matricies, how do you end up with NX 2x2 
matricies?

Perhaps if you show the code using loops with `np.dot` it will be clearer 
how to translate to using `np.einsum`

-Dave




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] segfault in np.arange

2014-10-24 Thread Dave Hirschfeld

Julian Taylor jtaylor.debian at googlemail.com writes:

 
 On 23.10.2014 19:21, Dave Hirschfeld wrote:
  Hi,
  I accidentally passed a pandas DatetimeIndex to `np.arange` which 
caused 
  it to segfault. It's a pretty dumb thing to do but I don't think it 
  should cause a segfault!
 
 thanks for the report, this patch should fix it:
 
 https://github.com/numpy/numpy/pull/5225
 


Thanks for the super-fast patch! Hopefully it will make it into the next 
bugfix release, even if it is an apparently little used functionality.


-Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] segfault in np.arange

2014-10-23 Thread Dave Hirschfeld

Hi,
I accidentally passed a pandas DatetimeIndex to `np.arange` which caused 
it to segfault. It's a pretty dumb thing to do but I don't think it 
should cause a segfault!


Python 2.7.5 |Continuum Analytics, Inc.| (default, Jul  1 2013, 
12:37:52)
[MSC v.1500 64 bit (AMD64)] on win32
Type help, copyright, credits or license for more information.
# import faulthandler
# faulthandler.enable()
# import numpy as np
# np.__version__
'1.9.0'
# import pandas as pd
# pd.__version__
'0.14.1'
# np.arange(pd.DatetimeIndex([]))
Fatal Python error: Segmentation fault

Current thread 0x1d18:
  File stdin, line 1 in module


The exception dialog which pops up contains the following information:

Unhandled exception at 0x0255EB49 (multiarray.pyd) in 
python.exe:
0xC005: Access violation reading location 0x0008.


Thanks,
Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Website down!

2014-08-20 Thread Dave Hirschfeld

It seems that the docs website is down?

http://docs.scipy.org/doc/


-Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] simple way to denote unchanged dimension in reshape?

2014-05-29 Thread Dave Hirschfeld

Chao YUE chaoyuejoy at gmail.com writes:

 
 
 
 Dear all,
 I have a simple question. Is there a way to denote the unchanged dimension 
in the reshape function? like suppose I have an array named arr having 
three dims with the first dimension length as 48, I want to reshape the 
first dim into 12*4, but keeping all the other dimension length unchanged. 
 
 like when we slice the array, we can use:  arr[10:40, ... ], ...' 
represents all remaining dimesions. 
 
 however when doing reshape, we must use:
 
 arr.reshape(12,-1,arr.shape(1),arr.shape(2))
 
 
 Is there something allowing more flexible reshape, like:
 
 arr.reshape(12,-1,...)?
 
 thanks a lot in advance,best,
 
 Chao
 

For the example given the below code works:

In [1]: x = randn(48,5,4,3,2)

In [2]: x.reshape(12,-1,*x.shape[1:]).shape
Out[2]: (12L, 4L, 5L, 4L, 3L, 2L)


HTH,
Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fancy Indexing of Structured Arrays is Slow

2014-05-21 Thread Dave Hirschfeld

Julian Taylor jtaylor.debian at googlemail.com writes:

 
 On 16.05.2014 10:59, Dave Hirschfeld wrote:
  Julian Taylor jtaylor.debian at googlemail.com writes:
  
  Yes, I'd heard about the improvements and am very excited to try them out 
  since indexing is one of the bottlenecks in our algorithm. 
  
 
 I made a PR with the simple change:
 https://github.com/numpy/numpy/pull/4721
 
 improves it by the expected 50%, but its still 40% slower than the
 improved normal indexing.
 

Having some problems building numpy to test this out, but assuming it does 
what it says on the tin I'd be very keen to get this in the impending 1.9 
release if possible.

Thanks,
Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fancy Indexing of Structured Arrays is Slow

2014-05-16 Thread Dave Hirschfeld

Sebastian Berg sebastian at sipsolutions.net writes:

 
 On Do, 2014-05-15 at 12:31 +, Dave Hirschfeld wrote:
  As can be seen from the code below (or in the notebook linked beneath) 
fancy 
  indexing of a structured array is twice as slow as indexing both fields 
  independently - making it 4x slower?
  
  
 
 The non-vanilla types tend to be somewhat more efficient with these
 things and the first indexing does not copy so it is rather fast. I did
 not check the code, but we use (also in the new one for this operation)
 the copyswap function on individual elements (only for non-trivial
 copies in 1.9 in later, making the difference even larger), and this is
 probably not specialized to the specific void type so it probably has to
 do call the copyswap for every field (and first get the fields). All
 that work would be done for every element.
 If you are interested in this, you could check the fancy indexing inner
 loop and see if replacing the copyswap with the specialized strided
 transfer functions (it is used further down in a different branch of the
 loop) actually makes things faster. I would expect so for some void
 types anyway, but not sure in general.
 
 - Sebastian
 

Thanks for the explanation and pointers - it sounds like a good opportunity 
for getting stuck into the internals of numpy which I've been meaning to do. 
I'm not sure I've got the required skills but I'm sure it will be a good 
learning experience.

Unfortunately it won't likely be in the immediate future that'll I'll have 
the time to do so. In the meantime I can live with indexing the fields 
independently.

Thanks,
Dave



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fancy Indexing of Structured Arrays is Slow

2014-05-16 Thread Dave Hirschfeld

Julian Taylor jtaylor.debian at googlemail.com writes:

 
 
 if ~50% faster is fast enough a simple improvement would be to replace
 the use of PyArg_ParseTuple with manual tuple unpacking.
 The PyArg functions are incredibly slow and is not required in
 VOID_copyswap which just extracts 'Oi.
 
 This 50% increase still makes it slower than the simpler indexing
 variant as these have been greatly improved in 1.9 (thanks to
 Sebastian for this :) )
 

Yes, I'd heard about the improvements and am very excited to try them out 
since indexing is one of the bottlenecks in our algorithm. 

-Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Fancy Indexing of Structured Arrays is Slow

2014-05-15 Thread Dave Hirschfeld

As can be seen from the code below (or in the notebook linked beneath) fancy 
indexing of a structured array is twice as slow as indexing both fields 
independently - making it 4x slower?

I found that fancy indexing was a bottleneck in my application so I was 
hoping to reduce the overhead by combining the arrays into a structured 
array and only doing one indexing operation. Unfortunately that doubled the 
time that it took!

Is there any reason for this? If not, I'm happy to open an enhancement issue 
on GitHub - just let me know.

Thanks,
Dave


In [32]: nrows, ncols = 365, 1

In [33]: items = np.rec.fromarrays(randn(2,nrows, ncols), names=
['widgets','gadgets'])

In [34]: row_idx = randint(0, nrows, ncols)
...: col_idx = np.arange(ncols)

In [35]: %timeit filtered_items = items[row_idx, col_idx]
100 loops, best of 3: 3.45 ms per loop

In [36]: %%timeit 
...: widgets = items['widgets'][row_idx, col_idx]
...: gadgets = items['gadgets'][row_idx, col_idx]
...: 
1000 loops, best of 3: 1.57 ms per loop


http://nbviewer.ipython.org/urls/gist.githubusercontent.com/dhirschfeld/98b9
970fb68adf23dfea/raw/10c0f968ea1489f0a24da80d3af30de7106848ac/Slow%20Structu
red%20Array%20Indexing.ipynb

https://gist.github.com/dhirschfeld/98b9970fb68adf23dfea



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Dates and times and Datetime64 (again)

2014-03-19 Thread Dave Hirschfeld

Sankarshan Mudkavi smudkavi at uwaterloo.ca writes:

 
 Hey all,
 It's been a while since the last datetime and timezones discussion thread 
was visited (linked below):
 
 http://thread.gmane.org/gmane.comp.python.numeric.general/53805
 
 It looks like the best approach to follow is the UTC only approach in the 
linked thread with an optional flag to indicate the timezone (to avoid 
confusing applications where they don't expect any timezone info). Since 
this is slightly more useful than having just a naive datetime64 package and 
would be open to extension if required, it's probably the best way to start 
improving the datetime64 library.
 
snip
 I would like to start writing a NEP for this followed by implementation, 
however I'm not sure what the format etc. is, could someone direct me to a 
page where this information is provided?
 
 Please let me know if there are any ideas, comments etc.
 
 Cheers,
 Sankarshan
 

See: http://article.gmane.org/gmane.comp.python.numeric.general/55191


You could use a current NEP as a template:
https://github.com/numpy/numpy/tree/master/doc/neps


I'm a huge +100 on the simplest UTC fix.

As is, using numpy datetimes is likely to silently give incorrect results - 
something I've already seen several times in end-user data analysis code.

Concrete Example:

In [16]: dates = pd.date_range('01-Apr-2014', '04-Apr-2014', freq='H')[:-1]
...: values = np.array([1,2,3]).repeat(24)
...: records = zip(map(str, dates), values)
...: pd.TimeSeries(values, dates).groupby(lambda d: d.date()).mean()
...: 
Out[16]: 
2014-04-011
2014-04-022
2014-04-033
dtype: int32

In [17]: df = pd.DataFrame(np.array(records, dtype=[('dates', 'M8[h]'), 
('values', float)]))
...: df.set_index('dates', inplace=True)
...: df.groupby(lambda d: d.date()).mean()
...: 
Out[17]: 
  values
2014-03-31  1.00
2014-04-01  1.041667
2014-04-02  2.041667
2014-04-03  3.00

[4 rows x 1 columns]

Try it in your timezone and see what you  get!

-Dave



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Dates and times and Datetime64 (again)

2014-03-19 Thread Dave Hirschfeld

Jeff Reback jeffreback at gmail.com writes:

 
 Dave,
 
 your example is not a problem with numpy per se, rather that the default 
generation is in local timezone (same as what python datetime does).
 If you localize to UTC you get the results that you expect. 
 

The problem is that the default datetime generation in *numpy* is in local 
time.

Note that this *is not* the case in Python - it doesn't try to guess the 
timezone info based on where in the world you run the code, if it's not 
provided it sets it to None.

In [7]: pd.datetime?
Type:   type
String Form:type 'datetime.datetime'
Docstring:
datetime(year, month, day[, hour[, minute[, second[, 
microsecond[,tzinfo])

The year, month and day arguments are required. tzinfo may be None, or an
instance of a tzinfo subclass. The remaining arguments may be ints or longs.

In [8]: pd.datetime(2000,1,1).tzinfo is None
Out[8]: True


This may be the best solution but as others have pointed out this is more 
difficult to implement and may have other issues.

I don't want to wait for the best solution - the assume UTC on input/output 
if not specified will solve the problem and this desperately needs to be 
fixed because it's completely broken as is IMHO.


 If you localize to UTC you get the results that you expect. 

That's the whole point - *numpy* needs to localize to UTC, not to whatever 
timezone you happen to be in when running the code. 

In a real-world data analysis problem you don't start with the data in a 
DataFrame or a numpy array it comes from the web, a csv, Excel, a database 
and you want to convert it to a DataFrame or numpy array. So what you have 
from whatever source is a list of tuples of strings and you want to convert 
them into a typed array.

Obviously you can't localize a string - you have to convert it to a date 
first and if you do that with numpy the date you have is wrong. 

In [108]: dst = np.array(['2014-03-30 00:00', '2014-03-30 01:00', '2014-03-
30 02:00'], dtype='M8[h]')
 ...: dst
 ...: 
Out[108]: array(['2014-03-30T00+', '2014-03-30T00+', '2014-03-
30T02+0100'], dtype='datetime64[h]')

In [109]: dst.tolist()
Out[109]: 
[datetime.datetime(2014, 3, 30, 0, 0),
 datetime.datetime(2014, 3, 30, 0, 0),
 datetime.datetime(2014, 3, 30, 1, 0)]


AFAICS there's no way to get the original dates back once they've passed 
through numpy's parser!?


-Dave





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] svd error checking vs. speed

2014-02-17 Thread Dave Hirschfeld

alex argriffi at ncsu.edu writes:

 
 Hello list,
 
 Here's another idea resurrection from numpy github comments that I've
 been advised could be posted here for re-discussion.
 
 The proposal would be to make np.linalg.svd more like scipy.linalg.svd
 with respect to input checking.  The argument against the change is
 raw speed; if you know that you will never feed non-finite input to
 svd, then np.linalg.svd is a bit faster than scipy.linalg.svd.  An
 argument for the change could be to avoid issues reported on github
 like crashes, hangs, spurious non-convergence exceptions, etc. from
 the undefined behavior of svd of non-finite input.
 
 
 [...] the following numpy code hangs until I `kill -9` it.
 
 ```
 $ python runtests.py --shell
 $ python
 Python 2.7.5+
 [GCC 4.8.1] on linux2
  import numpy as np
  np.__version__
 '1.9.0.dev-e3f0f53'
  A = np.array([[1e3, 0], [0, 1]])
  B = np.array([[1e300, 0], [0, 1]])
  C = np.array([[1e3000, 0], [0, 1]])
  np.linalg.svd(A)
 (array([[ 1.,  0.],
[ 0.,  1.]]), array([ 1000., 1.]), array([[ 1.,  0.],
[ 0.,  1.]]))
  np.linalg.svd(B)
 (array([[ 1.,  0.],
[ 0.,  1.]]), array([  1.e+300,   1.e+000]),
 array([[ 1.,  0.],
[ 0.,  1.]]))
  np.linalg.svd(C)
 [hangs forever]
 ```
 
 
 Alex
 

I'm -1 on checking finiteness - if there's one place you usually want 
maximum performance it's linear algebra operations.

It certainly shouldn't crash or hang though and for me at least it doesn't - 
it returns NaN which immediately suggests to me that I've got bad input 
(maybe just because I've seen it before).

I'm not sure adding an extra kwarg is worth cluttering up the api when a 
simple call to isfinite beforehand will do the job if you think you may 
potentially have non-finite input.


Python 2.7.5 |Anaconda 1.8.0 (64-bit)| (default, Jul  1 2013, 12:37:52) [MSC 
v.1500 64 bit (AMD64)]

In [1]: import numpy as np

In [2]:  A = np.array([[1e3, 0], [0, 1]])
   ...:  B = np.array([[1e300, 0], [0, 1]])
   ...:  C = np.array([[1e3000, 0], [0, 1]])
   ...:  np.linalg.svd(A)
   ...: 
Out[2]: 
(array([[ 1.,  0.],
   [ 0.,  1.]]),
 array([ 1000., 1.]),
 array([[ 1.,  0.],
   [ 0.,  1.]]))

In [3]: np.linalg.svd(B)
Out[3]: 
(array([[ 1.,  0.],
   [ 0.,  1.]]),
 array([  1.e+300,   1.e+000]),
 array([[ 1.,  0.],
   [ 0.,  1.]]))

In [4]: C
Out[4]: 
array([[ inf,   0.],
   [  0.,   1.]])

In [5]: np.linalg.svd(C)
Out[5]: 
(array([[ 0.,  1.],
   [ 1.,  0.]]),
 array([ nan,  nan]),
 array([[ 0.,  1.],
   [ 1.,  0.]]))

In [6]: np.__version__
Out[6]: '1.7.1'


Regards,
Dave


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] svd error checking vs. speed

2014-02-17 Thread Dave Hirschfeld

Sturla Molden sturla.molden at gmail.com writes:

 
 josef.pktd at gmail.com wrote:
 
  I use official numpy release for development, Windows, 32bit python,
  i.e. MingW 3.5 and whatever old ATLAS the release includes.
  
  a constant 13% cpu usage is 1/8 th of my 8 virtual cores.
 
 Based on this and Alex' message it seems the offender is the f2c generated
 lapack_lite library.
 
 So do we do with lapack_lite? Should we patch it?
 
 Sturla
 

Even if lapack_lite always performed the isfinite check and threw a python 
error if False, it would be much better than either hanging or segfaulting and 
people who care about the isfinite cost probably would be linking to a fast 
lapack anyway.

-Dave


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Numpy 1.9 release date

2013-11-10 Thread Dave Hirschfeld

Ralf Gommers ralf.gommers at gmail.com writes:

 

 On Fri, Nov 8, 2013 at 8:22 PM, Charles R Harris charlesr.harris at 
gmail.com wrote:
 
 
 and think that the main thing missing at this point is fixing the datetime 
problems.
 
 
 Is anyone planning to work on this? If yes, you need a rough estimate of 
when this is ready to go. If no, it needs to be decided if this is critical 
for the release. From the previous discussion I tend to think so. If it's 
critical but no one does it, why plan a release... 
 
 
 Ralf
 

Just want to pipe up here as to the criticality of datetime bug.

Below is a minimal example from some data analysis code I found in our 
company that was giving incorrect results (fortunately it was caught by 
thorough testing):

In [110]: records = [
 ...:  ('2014-03-29 23:00:00', '2014-03-29 23:00:00'),
 ...:  ('2014-03-30 00:00:00', '2014-03-30 00:00:00'),
 ...:  ('2014-03-30 01:00:00', '2014-03-30 01:00:00'),
 ...:  ('2014-03-30 02:00:00', '2014-03-30 02:00:00'),
 ...:  ('2014-03-30 03:00:00', '2014-03-30 03:00:00'),
 ...:  ('2014-10-25 23:00:00', '2014-10-25 23:00:00'),
 ...:  ('2014-10-26 00:00:00', '2014-10-26 00:00:00'),
 ...:  ('2014-10-26 01:00:00', '2014-10-26 01:00:00'),
 ...:  ('2014-10-26 02:00:00', '2014-10-26 02:00:00'),
 ...:  ('2014-10-26 03:00:00', '2014-10-26 03:00:00')]
 ...: 
 ...: 
 ...: data = np.asarray(records, dtype=[('date obj', 'M8[h]'), ('str 
repr', object)])
 ...: df = pd.DataFrame(data)

In [111]: df
Out[111]: 
 date obj str repr
0 2014-03-29 23:00:00  2014-03-29 23:00:00
1 2014-03-30 00:00:00  2014-03-30 00:00:00
2 2014-03-30 00:00:00  2014-03-30 01:00:00
3 2014-03-30 01:00:00  2014-03-30 02:00:00
4 2014-03-30 02:00:00  2014-03-30 03:00:00
5 2014-10-25 22:00:00  2014-10-25 23:00:00
6 2014-10-25 23:00:00  2014-10-26 00:00:00
7 2014-10-26 01:00:00  2014-10-26 01:00:00
8 2014-10-26 02:00:00  2014-10-26 02:00:00
9 2014-10-26 03:00:00  2014-10-26 03:00:00


Note the local timezone adjusted `date obj` including the duplicate value at 
the clock-change in March and the missing value at the clock-change in 
October. As you can imagine this could very easily lead to incorrect 
analysis.

If running this exact same code in the (Eastern) US you'd see the following 
results:
 date obj str repr
0 2014-03-30 03:00:00  2014-03-29 23:00:00
1 2014-03-30 04:00:00  2014-03-30 00:00:00
2 2014-03-30 05:00:00  2014-03-30 01:00:00
3 2014-03-30 06:00:00  2014-03-30 02:00:00
4 2014-03-30 07:00:00  2014-03-30 03:00:00
5 2014-10-26 03:00:00  2014-10-25 23:00:00
6 2014-10-26 04:00:00  2014-10-26 00:00:00
7 2014-10-26 05:00:00  2014-10-26 01:00:00
8 2014-10-26 06:00:00  2014-10-26 02:00:00
9 2014-10-26 07:00:00  2014-10-26 03:00:00


Unfortunately I don't have the skills to meaningfully contribute in this 
area but it is a very real problem for users of numpy, many of whom are not 
active on the mailing list.

HTH,
Dave


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] add .H attribute?

2013-07-24 Thread Dave Hirschfeld

 josef.pktd at gmail.com writes:

 
 I think a H is feature creep and too specialized
 
 What's .H of a int a str a bool ?
 
 It's just .T and a view, so you cannot rely that conj() makes a copy
 if you don't work with complex.
 
 .T is just a reshape function and has **nothing** to do with matrix 
algebra.
 

It seems to me that that ship has already sailed - i.e. conj doesn't make 
much sense for str arrays, but it still works in the sense that it's a nop

In [16]: A = asarray(list('abcdefghi')).reshape(3,3)
...: np.all(A.T == A.conj().T)
...: 
Out[16]: True

If we're voting my vote goes to add the .H attribute for all the reasons 
Alan has specified. Document that it returns a copy but that it may in 
future return a view so it it not future proof to operate on the result 
inplace.

I'm -1 on .H() as it will require code changes if it ever changes to a 
property and it will simply result in questions about why .T is a property 
and .H is a function (and why it's a property for (sparse) matrices)

Regarding Dag's example:

xh = x.H
x *= 2
assert np.all(2 * xh == x.H)


I'm sceptical that there's much code out there actually relying on the fact 
that a transpose is a view with the specified intention of altering the 
original array inplace.

I work with a lot of beginners and whenever I've seen them operate inplace 
on a transpose it has been a bug in the code, leading to a discussion of 
how, for performance reasons, numpy will return a view where possible, 
leading to yet further discussion of when it is and isn't possible to return 
a view.

The third option of .H returning a view would probably be agreeable to 
everyone but I don't think we should punt on this decision for something 
that if it does happen is likely years away. It seems that work on this 
front is happening in different projects to numpy. Even if for example 
sometime in the future numpy's internals were replaced with libdynd or other 
expression graph engine surely this would result in more breaking changes 
than .H returning a view rather than a copy?!

IANAD so I'm happy with whatever the consensus is I just thought I'd put 
forward the view from a (specific type of) user perspective.

Regards,
Dave




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] add .H attribute?

2013-07-24 Thread Dave Hirschfeld

Nathaniel Smith njs at pobox.com writes:

 
 
 As soon as you talk about attributes returning things you've already
 broken Python's mental model... attributes are things that sit there,
 not things that execute arbitrary code. Of course this is not how the
 actual implementation works, attribute access *can* in fact execute
 arbitrary code, but the mental model is important, so we should
 preserve it where-ever we can. Just mentioning an attribute should not
 cause unbounded memory allocations.
 

Yep, sorry - sloppy use of terminology which I agree is important in helping 
understand what's happening.

-Dave




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] add .H attribute?

2013-07-23 Thread Dave Hirschfeld

Alan G Isaac alan.isaac at gmail.com writes:

 
 On 7/22/2013 3:10 PM, Nathaniel Smith wrote:
  Having .T but not .H is an example of this split.
 
 Hate to do this but ...
 
   Readability counts.

+10!

A.conjugate().transpose() is unspeakably horrible IMHO. Since there's no way 
to avoid a copy you gain nothing by not providing the convenience function. 

It should be fairly obvious that an operation which changes the values of an 
array (and doesn't work in-place) necessarily takes a copy. I think it's more 
than sufficient to simply document the fact that A.H will return a copy.

A user coming from Matlab probably doesn't care that it takes a copy but you'd 
be hard pressed to convince them there's any benefit of writing 
A.conjugate().transpose() over exactly what it looks like in textbooks - A.H

Regards,
Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] datetime64 constructor ignores dtype argument?

2013-06-12 Thread Dave Hirschfeld

The example below demonstrates the fact that the datetime64 constructor 
ignores the dtype argument if passed in. Is this conscious design decision or 
a bug/oversight?


In [55]: from datetime import datetime
...: d = datetime.now()
...: 

In [56]: d
Out[56]: datetime.datetime(2013, 6, 12, 11, 3, 27, 861000)

In [57]: np.datetime64(d)
Out[57]: numpy.datetime64('2013-06-12T12:03:27.861000+0100')

In [58]: np.datetime64(d).dtype
Out[58]: dtype('M8[us]')

In [59]: np.datetime64(d, dtype='M8[D]')
Out[59]: numpy.datetime64('2013-06-12T12:03:27.861000+0100')

In [60]: np.datetime64(d, dtype='M8[D]').dtype
Out[60]: dtype('M8[us]')???

In [61]: np.datetime64(d).astype('M8[D]')
Out[61]: numpy.datetime64('2013-06-12')


Thanks,
Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 1.8 release

2013-04-25 Thread Dave Hirschfeld

Charles R Harris charlesr.harris at gmail.com writes:

 
 Hi All,I think it is time to start the runup to the 1.8 release. I don't 
know of any outstanding blockers but if anyone has a PR/issue that they feel 
needs to be in the next Numpy release now is the time to make it known.Chuck
 

It would be good to get the utc-everywhere fix for datetime64 in there if 
someone has time to look into it.

Thanks,
Dave




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] timezones and datetime64

2013-04-03 Thread Dave Hirschfeld

Andreas Hilboll lists at hilboll.de writes:

 
  
  I think your point about using current timezone in interpreting user
  input being dangerous is probably correct --- perhaps UTC all the way
  would be a safer (and simpler) choice?
 
 +1
 

+10 from me!

I've recently come across a bug due to the fact that numpy interprets dates as 
being in the local timezone.

The data comes from a database query where there is no timezone information 
supplied (and dates are stored as strings). It is assumed that the user doesn't 
need to know the timezone - i.e. the dates are timezone naive.

Working out the correct timezones would be fairly laborious, but whatever the 
correct timezones are, they're certainly not the timezone the current user 
happens to find themselves in!

e.g.

In [32]: rs = [
...: (u'2000-01-17 00:00:00.00', u'2000-02-01', u'2000-02-29', 0.1203),
...: (u'2000-01-26 00:00:00.00', u'2000-02-01', u'2000-02-29', 0.1369),
...: (u'2000-01-18 00:00:00.00', u'2000-03-01', u'2000-03-31', 0.1122),
...: (u'2000-02-25 00:00:00.00', u'2000-03-01', u'2000-03-31', 0.1425)
...: ]
...: dtype = [('issue_date', 'datetime64[ns]'),
...:  ('start_date', 'datetime64[D]'),
...:  ('end_date', 'datetime64[D]'),
...:  ('value', float)]
...: #

In [33]: # What I see in London, UK
...: recordset = np.array(rs, dtype=dtype)
...: df = pd.DataFrame(recordset)
...: df = df.set_index('issue_date')
...: df
...: 
Out[33]: 
start_dateend_date   value
issue_date
2000-01-17 2000-02-01 00:00:00 2000-02-29 00:00:00  0.1203
2000-01-26 2000-02-01 00:00:00 2000-02-29 00:00:00  0.1369
2000-01-18 2000-03-01 00:00:00 2000-03-31 00:00:00  0.1122
2000-02-25 2000-03-01 00:00:00 2000-03-31 00:00:00  0.1425

In [34]: # What my colleague sees in Auckland, NZ
...: recordset = np.array(rs, dtype=dtype)
...: df = pd.DataFrame(recordset)
...: df = df.set_index('issue_date')
...: df
...: 
Out[34]: 
 start_dateend_date   value
issue_date 
2000-01-16 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00  0.1203
2000-01-25 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00  0.1369
2000-01-17 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00  0.1122
2000-02-24 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00  0.1425


Oh dear!

This isn't acceptable for my use case (in a multinational company) and I found 
no reasonable way around it other than bypassing the numpy conversion entirely 
by setting the dtype to object, manually parsing the strings and creating an 
array from the list of datetime objects.

Regards,
Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] timezones and datetime64

2013-04-03 Thread Dave Hirschfeld

Nathaniel Smith njs at pobox.com writes:

 
 On Wed, Apr 3, 2013 at 2:26 PM, Dave Hirschfeld
 dave.hirschfeld at gmail.com wrote:
 
  This isn't acceptable for my use case (in a multinational company) and I 
found
  no reasonable way around it other than bypassing the numpy conversion 
entirely
  by setting the dtype to object, manually parsing the strings and creating an
  array from the list of datetime objects.
 
 Wow, that's truly broken. I'm sorry.
 
 I'm skeptical that just switching to UTC everywhere is actually the
 right solution. It smells like one of those solutions that's simple,
 neat, and wrong. (I don't know anything about calendar-time series
 handling, so I have no ability to actually judge this stuff, but
 wouldn't one problem be if you want to know about business days/hours?
 You lose the original day-of-year once you move everything to UTC.)
 Maybe datetime dtypes should be parametrized by both granularity and
 timezone? Or we could just declare that datetime64 is always
 timezone-naive and adjust the code to match?
 
 I'll CC the pandas list in case they have some insight. Unfortunately
 AFAIK no-one who's regularly working on numpy this point works with
 datetimes, so we have limited ability to judge solutions... please
 help!
 
 -n
 

It think simply setting the timezone to UTC if it's not specified would solve 
99% of use cases because IIUC the internal representation is UTC so numpy would 
be doing no conversion of the dates that were passed in. It was the conversion 
which was the source of the error in my example.

The only potential issue with this is that the dates might take along an 
incorrect UTC timezone, making it more difficult to work with naive datetimes.

e.g.

In [42]: d = np.datetime64('2014-01-01 00:00:00', dtype='M8[ns]')

In [43]: d
Out[43]: numpy.datetime64('2014-01-01T00:00:00+')

In [44]: str(d)
Out[44]: '2014-01-01T00:00:00+'

In [45]: pydate(str(d))
Out[45]: datetime.datetime(2014, 1, 1, 0, 0, tzinfo=tzutc())

In [46]: pydate(str(d)) == datetime.datetime(2014, 1, 1)
Traceback (most recent call last):

  File ipython-input-46-abfc0fee9b97, line 1, in module
pydate(str(d)) == datetime.datetime(2014, 1, 1)

TypeError: can't compare offset-naive and offset-aware datetimes


In [47]: pydate(str(d)) == datetime.datetime(2014, 1, 1, tzinfo=tzutc())
Out[47]: True

In [48]: pydate(str(d)).replace(tzinfo=None) == datetime.datetime(2014, 1, 1)
Out[48]: True


In this case it may be best to have numpy not try to set the timezone at all if 
none was specified. Given the internal representation is UTC I'm not sure this 
is feasible though so defaulting to UTC may be the best solution.

Regards,
Dave


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] New numpy functions: filled, filled_like

2013-01-14 Thread Dave Hirschfeld

Robert Kern robert.kern at gmail.com writes:

 
  
   One alternative that does not expand the API with two-liners is to let
   the ndarray.fill() method return self:
  
 a = np.empty(...).fill(20.0)
 
  This violates the convention that in-place operations never return
  self, to avoid confusion with out-of-place operations. E.g.
  ndarray.resize() versus ndarray.reshape(), ndarray.sort() versus
  np.sort(), and in the broader Python world, list.sort() versus
  sorted(), list.reverse() versus reversed(). (This was an explicit
  reason given for list.sort to not return self, even.)
 
  Maybe enabling this idiom is a good enough reason to break the
  convention (Special cases aren't special enough to break the rules. /
  Although practicality beats purity), but it at least makes me -0 on
  this...
 
 
  I tend to agree with the notion that inplace operations shouldn't return
  self, but I don't know if it's just because I've been conditioned this way.
  Not returning self breaks the fluid interface pattern [1], as noted in a
  similar discussion on pandas [2], FWIW, though there's likely some way to
  have both worlds.
 
  Ah-hah, here's the email where Guide officially proclaims that there
  shall be no fluent interface nonsense applied to in-place operators
  in Python, because it hurts readability (at least for Dutch people
  ):
http://mail.python.org/pipermail/python-dev/2003-October/038855.html
 
 That's a statement about the policy for the stdlib, and just one
 person's opinion. You, and numpy, are permitted to have a different
 opinion.
 
 In any case, I'm not strongly advocating for it. It's violation of
 principle (no fluent interfaces) is roughly in the same ballpark as
 np.filled() (not every two-liner needs its own function), so I
 thought I would toss it out there for consideration.
 
 --
 Robert Kern
 

FWIW I'm +1 on the idea. Perhaps because I just don't see many practical 
downsides to breaking the convention but I regularly see a big issue with there 
being no way to instantiate an array with a particular value.

The one obvious way to do it is use ones and multiply by the value you want. I 
work with a lot of inexperienced programmers and I see this idiom all the time. 
It takes a fair amount of numpy knowledge to know that you should do it in two 
lines by using empty and setting a slice.

In [1]: %timeit NaN*ones(1)
1000 loops, best of 3: 1.74 ms per loop

In [2]: %%timeit
   ...: x = empty(1, dtype=float)
   ...: x[:] = NaN
   ...: 
1 loops, best of 3: 28 us per loop

In [3]: 1.74e-3/28e-6
Out[3]: 62.142857142857146


Even when not in the mythical tight loop setting an array to one and then 
multiplying uses up a lot of cycles - it's nearly 2 orders of magnitude slower 
than what we know they *should* be doing.

I'm agnostic as to whether fill should be modified or new functions provided 
but 
I think numpy is currently missing this functionality and that providing it 
would save a lot of new users from shooting themselves in the foot performance-
wise.

-Dave





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Bug in as_strided/reshape

2012-08-10 Thread Dave Hirschfeld

Sebastian Berg sebastian at sipsolutions.net writes:

 
 Hello,
 
 looking at the code, when only adding/removing dimensions with size 1,
 numpy takes a small shortcut, however it uses 0 stride lengths as value
 for the new one element dimensions temporarily, then replacing it again
 to ensure the new array is contiguous.
 This replacing does not check if the dimension has more then size 1.
 Likely there is a better way to fix it, but the attached diff should do
 it.
 
 Regards,
 
 Sebastian
 

Thanks for the confirmation. So this doesn't get lost I've opened issue #380 on 
GitHub

https://github.com/numpy/numpy/issues/380

-Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Second try: possible bug in assignment to complex array

2012-08-10 Thread Dave Hirschfeld

Mark Bakker markbak at gmail.com writes:

 
 I think there is a problem with assigning a 1D complex array of length one
 to a position in another complex array.
 Example:
 a = ones(1,'D')
 b = ones(1,'D')
 a[0] = b
 ---
 TypeError                                 Traceback (most recent call last)
 ipython-input-37-0c4fc6d780e3 in module()
  1 a[0] = b
 TypeError: can't convert complex to float
 This works correctly when a and b are real arrays:
 a = ones(1)
 b = ones(1)
 a[0] = b
 Bug or feature?
 Thanks,
 Mark
 

I can't help unfortunately, but I can confirm that I also see the problem
on Win32 Python 2.7.3, numpy 1.6.2.

As a workaround it appears that slicing works:


In [15]: sys.version
Out[15]: '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]'

In [16]: sys.version
Out[16]: '2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]'

In [17]: np.__version__
Out[17]: '1.6.2'

In [18]: a = ones(1,'D')

In [19]: b = 2*ones(1,'D')

In [20]: a[0] = b
---
TypeError Traceback (most recent call last)
ipython-input-20-0c4fc6d780e3 in module()
 1 a[0] = b

TypeError: can't convert complex to float

In [21]: a[0:1] = b

In [22]: 

-Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Bug in as_strided/reshape

2012-08-09 Thread Dave Hirschfeld

Dave Hirschfeld dave.hirschfeld at gmail.com writes:

 
 It seems that reshape doesn't work correctly on an array which has been
 resized using the 0-stride trick e.g.
 
 In [73]: x = array([5])
 
 In [74]: y = as_strided(x, shape=(10,), strides=(0,))
 
 In [75]: y
 Out[75]: array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])
 
 In [76]: y.reshape([10,1])
 Out[76]: 
 array([[  5],
[  8],
[  762933412],
[-2013265919],
[ 26],
[ 64],
[  762933414],
[-2013244356],
[ 26],
[ 64]])  Should all be 5
 
 In [77]: y.copy().reshape([10,1])
 Out[77]: 
 array([[5],
[5],
[5],
[5],
[5],
[5],
[5],
[5],
[5],
[5]])
 
 In [78]: np.__version__
 Out[78]: '1.6.2'
 
 Perhaps a clause such as below is required in reshape?
 
 if any(stride == 0 for stride in y.strides):
 return y.copy().reshape(shape)
 else:
 return y.reshape(shape)
 
 Regards,
 Dave
 

Though it would be good to avoid the copy which you should be able to do in 
this 
case. Investigating further:

In [15]: y.strides
Out[15]: (0,)

In [16]: z = y.reshape([10,1])

In [17]: z.strides
Out[17]: (4, 4)

In [18]: z.strides = (0, 4)

In [19]: z
Out[19]: 
array([[5],
   [5],
   [5],
   [5],
   [5],
   [5],
   [5],
   [5],
   [5],
   [5]])

In [32]: y.reshape([5, 2])
Out[32]: 
array([[5, 5],
   [5, 5],
   [5, 5],
   [5, 5],
   [5, 5]])

In [33]: y.reshape([5, 2]).strides
Out[33]: (0, 0)

So it seems that reshape is incorrectly setting the stride of axis0 to 4, but 
only when the appended axis is of size 1.

-Dave




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Bug in as_strided/reshape

2012-08-08 Thread Dave Hirschfeld

It seems that reshape doesn't work correctly on an array which has been
resized using the 0-stride trick e.g.

In [73]: x = array([5])

In [74]: y = as_strided(x, shape=(10,), strides=(0,))

In [75]: y
Out[75]: array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])

In [76]: y.reshape([10,1])
Out[76]: 
array([[  5],
   [  8],
   [  762933412],
   [-2013265919],
   [ 26],
   [ 64],
   [  762933414],
   [-2013244356],
   [ 26],
   [ 64]])  Should all be 5

In [77]: y.copy().reshape([10,1])
Out[77]: 
array([[5],
   [5],
   [5],
   [5],
   [5],
   [5],
   [5],
   [5],
   [5],
   [5]])

In [78]: np.__version__
Out[78]: '1.6.2'

Perhaps a clause such as below is required in reshape?

if any(stride == 0 for stride in y.strides):
return y.copy().reshape(shape)
else:
return y.reshape(shape)


Regards,
Dave



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Moving lib.recfunctions?

2011-07-06 Thread Dave Hirschfeld

Pierre GM pgmdevlist at gmail.com writes:


 
 Hello,
 The idea behin having a lib.recfunctions and not a rec.recfunctions or
whatever was to illustrate that the
 functions of this package are more generic than they appear. They work with
regular structured ndarrays
 and don't need recarrays. Methinks we gonna lose this aspect if you try to
rename it, but hey, your call.

I've never really thought there's much distinction between the two - AFAICT a
recarray is just a structured array with attribute access? If a function only
accepts a recarray (are there any?) isn't it just a simple call to
.view(np.recarray) to get it to work with structured arrays? Because of this
view I've always thought functions which worked on either should be grouped
together.

 As to as why they were never really advertised ? Because I never received any
feedback when I started
 developing them (developing is a big word here, I just took a lot of code that
John D Hunter had developed in
 matplotlib and make it more consistent). I advertised them once or twice on
the list, wrote the basic
 docstrings, but waited for other people to start using them. 
 Anyhow.
 So, yes, there might be some weird import to polish. Note that if you decided
to just rename the package and
 leave it where it was, it would probably be easier.
 
  The path of least resistance is to just import lib.recfunctions.* into
  the (already crowded) main numpy namespace and be done with it.
 
 Why ? Why can't you leave it available through numpy.lib ? Once again, if it's
only a matter of PRing, you
 could start writing an entry page in the doc describing the functions, that
would improve the visibility.
 

I do recall them being advertised a while ago, but when I came to look for them
I couldn't find them - IMHO np.rec is a much more intuitive (and nicer/shorter)
namespace than np.lib.recfunctions. I think having similar functionality in two
completely different namespaces is confusing  hard to remember.

It also doesn't help that np.lib.recfunctions isn't discoverable by t
ab-completion:

In [2]: np.lib.rec
np.lib.recfromcsv np.lib.recfromtxt

...of course you could probably find it with np.lookfor but it's one more
barrier to their use.

FWIW I'd be happy if the np.lib.recfunctions fuctions were made available in the
np.rec namespace (and possibly deprecate np.lib.recfunctions to avoid 
confusion?)

I'm conscious that as a user (not a developer) talk is cheap and I'm happy with
whatever the consensus is. I just thought I'd pipe up since it was only through
this thread that I re-discovered np.lib.recfunctions!

HTH,
Dave




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fixing up datetime

2011-06-08 Thread Dave Hirschfeld

Wes McKinney wesmckinn at gmail.com writes:

 
 
  - Fundamental need to be able to work with multiple time series,
  especially performing operations involving cross-sectional data
  - I think it's a bit hard for lay people to use (read: ex-MATLAB/R
  users). This is just my opinion, but a few years ago I thought about
  using it and concluded that teaching people how to properly use it (a
  precision tool, indeed!) was going to cause me grief.
  - The data alignment problem, best explained in code:
 

 
  - Inability to derive custom offsets:
 
  I can do:
 
  In [14]: ts.shift(2, offset=2 * datetools.BDay())
  Out[14]:
  2000-01-11 00:00:00    0.0503706684002
  2000-01-18 00:00:00    -1.7660004939
  2000-01-25 00:00:00    1.11716758554
  2000-02-01 00:00:00    -0.171029995265
  2000-02-08 00:00:00    -0.99876580126
  2000-02-15 00:00:00    -0.262729046405
 
  or even generate, say, 5-minutely or 10-minutely date ranges thusly:
 
  In [16]: DateRange('6/8/2011 5:00', '6/8/2011 12:00',
  offset=datetools.Minute(5))
  Out[16]:
  class 'pandas.core.daterange.DateRange'
  offset: 5 Minutes, tzinfo: None
  [2011-06-08 05:00:00, ..., 2011-06-08 12:00:00]
  length: 85
 

It would be nice to have a step argument in the date_array function. The
following works though:

In [96]: integers = r_[ts.Date('T',01-Aug-2011 05:00).value:
   ts.Date('T',06-Aug-2011 12:01).value:
   5]

In [97]: ts.date_array(integers, freq='T')
Out[97]:
DateArray([01-Aug-2011 05:00, 01-Aug-2011 05:05, 01-Aug-2011 05:10, ...,
   06-Aug-2011 11:45, 06-Aug-2011 11:50, 06-Aug-2011 11:55, ...,
   06-Aug-2011 12:00],
  freq='T')


  - (possible now??) Ability to have a set of frequency-naive dates
  (possibly not in order).
 
  This last point actually matters. Suppose you wanted to get the worst
  5-performing days in the SP 500 index:
 
  In [7]: spx.index
  Out[7]:
  class 'pandas.core.daterange.DateRange'
  offset: 1 BusinessDay, tzinfo: None
  [1999-12-31 00:00:00, ..., 2011-05-10 00:00:00]
  length: 2963
 
  # but this is OK
  In [8]: spx.order()[:5]
  Out[8]:
  2008-10-15 00:00:00    -0.0903497960942
  2008-12-01 00:00:00    -0.0892952780505
  2008-09-29 00:00:00    -0.0878970494885
  2008-10-09 00:00:00    -0.0761670761671
  2008-11-20 00:00:00    -0.0671229140321
 
  - W
 

Seems like you're looking for the tssort function:

In [90]: series = ts.time_series(randn(365),start_date='01-Jan-2011',freq='D')

In [91]: def tssort(series):
   : indices = series.argsort()
   : return ts.time_series(series._series[indices],
   :   series.dates[indices], 
   :   autosort=False)
   :

In [92]: tssort(series)[0:5]
Out[92]:
timeseries([-2.96602612 -2.81612524 -2.61558511 -2.59522921 -2.4899447 ],
   dates = [26-Aug-2011 18-Apr-2011 27-Aug-2011 21-Aug-2011 19-Nov-2011],
   freq  = D)


Pandas seems to offer an even higher level api than scikits.timeseries. I view
them as mostly complementary but I haven't (yet) had much experience with 
pandas...

-Dave



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fixing up datetime

2011-06-08 Thread Dave Hirschfeld

Mark Wiebe mwwiebe at gmail.com writes:
 
 
 It appears to me that a structured dtype with some further NumPy extensions 
 could entirely replace the 'events' metadata fairly cleanly. If the ufuncs 
 are extended to operate on structured arrays, and integers modulo n are 
 added as a new dtype, a dtype like 
 [('date', 'M8[D]'), ('event', 'i8[mod 100]')] could replace the current
 'M8[D]//100'.

Sounds like a cleaner API.

 

 As Dave H. summarized, we used a basic keyword to do the same thing in 
 scikits.timeseries, with the addition of some subfrequencies like A-SEP 
 to represent a year starting in September, for example. It works, but it's
 really not elegant a solution.

 
 This kind of thing definitely belongs in a layer above datetime.
 

That's fair enough - my perspective as a timeseries user is probably a lot
higher level. My main aim is to point out some of the higher level uses so that
the numpy dtype can be made compatible with them - I'd hate to have a 
situation where we have multiple different datetime representations
in the different libraries and having to continually convert at the 
boundaries.

That said, I think the starting point for a series at a weekly, quarterly
or annual frequency/unit/type is something which may need to be sorted out
at the lowest level...

 
 One overall impression I have about timeseries in general is the use of the
 term frequency synonymously with the time unit. To me, a frequency is a 
 numerical quantity with a unit of 1/(time unit), so while it's related to 
 the time unit, naming it the same is something the specific timeseries 
 domain has chosen to do, I think the numpy datetime class shouldn't have 
 anything called frequency in it, and I would like to remove the current 
 usage of that terminology from the codebase.
 

It seems that it's just a naming convention (possibly not the best) and 
can be used synonymously with the time unit/resolution/dtype

 I don't envision 'asfreq' being a datetime function, this is the kind 
 of thing that would layer on top in a specialized timeseries library. The 
 behavior of timedelta follows a more physics-like idea with regard to the 
 time unit, and I don't think something more complicated belongs at the bottom
 layer that is shared among all datetime uses.

I think since freq == dtype then asfreq == astype. From your examples it
seems to do the same thing - i.e. if you go to a lower resolution (freq) 
representation the higher resolution information is truncated - e.g. a 
monthly resolution date has no information about days/hours/minutes/seconds
etc. It's converting in the other direction: low -- high resolution
where the difference lies - numpy always converts to the start of the interval
whereas the timeseries Date class gives you the option of the start or the end.

 I'm thinking of a datetime as an infinitesimal moment in time, with the 
 unit representing the precision that can be stored. Thus, '2011', 
 '2011-01', and '2011-01-01T00:00:00.00Z' all represent the same moment in 
 time, but with a different unit of precision. Computationally, this 
 perspective is turning out to provide a pretty rich mechanism to do 
 operations on date.

I think this is where the main point of difference is. I use the timeseries 
as a container for data which arrives in a certain interval of time. e.g.
monthly temperatures are the average temperature over the interval defined by 
the particular month. It's not the temperature at the instant of time that the 
month began, or ended or of any particular instant in that month. 

Thus the conversion from a monthly resolution to say a daily resolution isn't 
well defined and the right thing to do is likely to be application specific.

For the average temperature example you may want to choose a value in the 
middle of the month so that you don't introduce a phase delay if you 
interpolate between the datapoints.

If for example you measured total rainfall in a month you might want to 
choose the last day of the month to represent the total rainfall for that
month as that was the only date in the interval where the total rainfall
did in fact equal the monthly value.

It may be as you say though that all this functionality does belong at a higher
level...

Regards,
Dave










___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fixing up datetime

2011-06-07 Thread Dave Hirschfeld

As a user of numpy/scipy in finance I thought I would put in my 2p worth as
it's something which is of great importance in this area.

I'm currently a heavy user of the scikits.timeseries package by Matt  Pierre 
and I'm also following the development of statsmodels and pandas should we 
require more sophisticated statistics in future. Hopefully the numpy datetime 
type will provide a foundation such packages can build upon...

I'll use the timeseries package for reference since I'm most familiar with it 
and it's a very good api for my requirements. Apologies to Matt/Pierre if
I get anything wrong - feel free to correct my misconceptions...

I think some of the complexity is coming from the definition of the timedelta.
In the timeseries package each date simply represents the number of periods
since the epoch and the difference between dates is therefore just and integer
with no attached metadata - its meaning is determined by the context it's used
in. e.g.

In [56]: M1 = ts.Date('M',01-Jan-2011)

In [57]: M2 = ts.Date('M',01-Jan-2012)

In [58]: M2 - M1
Out[58]: 12

timeseries gets on just fine without a timedelta type - a timedelta is just an
integer and if you add an integer to a date it's interpreted as the number of 
periods of that dates frequency. From a useability point of view M1 + 1 is 
much nicer than having to do something like M1 + ts.TimeDelta(M1.freq, 1).

Something like the dateutil relativedelta pacage is very convenient and 
could serve as a template for such functionality:

In [59]: from dateutil.relativedelta import relativedelta

In [60]: (D1 + 30).datetime
Out[60]: datetime.datetime(2011, 1, 31, 0, 0)

In [61]: (D1 + 30).datetime + relativedelta(months=1)
Out[61]: datetime.datetime(2011, 2, 28, 0, 0)

...but you can still get the same behaviour without a timedelta by asking that
the user explicitly specify what they mean by adding one month to a date of
a different frequency. e.g.

In [62]: (D1 + 30)
Out[62]: D : 31-Jan-2011

In [63]: _62.asfreq('M') + 1
Out[63]: M : Feb-2011

In [64]: (_62.asfreq('M') + 1).asfreq('D','END')
Out[64]: D : 28-Feb-2011

In [65]: (_62.asfreq('M') + 1).asfreq('D','START') + _62.day
Out[65]: D : 04-Mar-2011

As Pierre noted when converting dates from a lower frequency to a higher one
it's very useful (essential!) to be able to specify whether you want the end
or the start of the interval. It may also be useful to be able to specify an
arbitrary offset from either the start or the end of the interval so you could
do something like:

In [66]: (_62.asfreq('M') + 1).asfreq('D', offset=0)
Out[66]: D : 01-Feb-2011

In [67]: (_62.asfreq('M') + 1).asfreq('D', offset=-1)
Out[67]: D : 28-Feb-2011

In [68]: (_62.asfreq('M') + 1).asfreq('D', offset=15)
Out[68]: D : 16-Feb-2011

I don't think it's useful to define higher 'frequencies' as arbitrary multiples 
of lower 'frequencies' unless the conversion is exact otherwise it leads
to the following inconsistencies:

In [69]: days_per_month = 30

In [70]: D1 = M1.asfreq('D',relation='START')

In [71]: D2 = M2.asfreq('D','START')

In [72]: D1, D2
Out[72]: (D : 01-Jan-2011, D : 01-Jan-2012)

In [73]: D1 + days_per_month*(M2 - M1)
Out[73]: D : 27-Dec-2011

In [74]: D1 + days_per_month*(M2 - M1) == D2
Out[74]: False

If I want the number of days between M1 and M2 I explicitely do the conversion 
myself:

In [75]: M2.asfreq('D','START') - M1.asfreq('D','START')
Out[75]: 365

thus avoiding any inconsistency:

In [76]: D1 + (M2.asfreq('D','START') - M1.asfreq('D','START')) == D2
Out[76]: True

I'm not convinced about the events concept - it seems to add complexity
for something which could be accomplished better in other ways. A [Y]//4
dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There
may well be a good reason for it however I can't see the need for it in my
own applications.

In the timeseries package, because the difference between dates represents the 
number of periods between the dates they must be of the same frequency to 
unambiguopusly define what a period means:

In [77]: M1 - D1
---
ValueErrorTraceback (most recent call last)

C:\dev\code\ipython console in module()

ValueError: Cannot subtract Date objects with different frequencies.

I would argue that in the spirit that we're all consenting adults 
adding dates of the same frequency can be a useful thing for example 
in finding the mid-point between two dates:

In [78]: M1.asfreq('S','START')
Out[78]: S : 01-Jan-2011 00:00:00

In [79]: M2.asfreq('S','START')
Out[79]: S : 01-Jan-2012 00:00:00

In [80]: ts.Date('S', (_64.value + _65.value)//2)
Out[80]: S : 02-Jul-2011 12:00:00

I think any errors which arose from adding or multiplying dates would be pretty
easy to spot in your code.

As Robert mentioned the economic data we use is often supplied as weekly, 
monthly, quarterly or annual data. So these frequencies are critical if we're
to use the

Re: [Numpy-discussion] fixing up datetime

2011-06-07 Thread Dave Hirschfeld

Robert Kern robert.kern at gmail.com writes:

 
 On Tue, Jun 7, 2011 at 07:34, Dave Hirschfeld dave.hirschfeld at gmail.com
wrote:
 
  I'm not convinced about the events concept - it seems to add complexity
  for something which could be accomplished better in other ways. A [Y]//4
  dtype is better specified as [3M] dtype, a [D]//100 is an [864S]. There
  may well be a good reason for it however I can't see the need for it in my
  own applications.
 
 Well, [D/100] doesn't represent [864s]. It represents something that
 happens 100 times a day, but not necessarily at precise regular
 intervals. For example, suppose that I am representing payments that
 happen twice a month, say on the 1st and 15th of every month, or the
 5th and 20th. I would use [M/2] to represent that. It's not [2W], and
 it's not [15D]. It's twice a month.
 
 The default conversions may seem to imply that [D/100] is equivalent
 to [864s], but they are not intended to. They are just a starting
 point for one to write one's own, more specific conversions.
 Similarly, we have default conversions from low frequencies to high
 frequencies defaulting to representing the higher precision event at
 the beginning of the low frequency interval. E.g. for days-seconds,
 we assume that the day is representing the initial second at midnight
 of that day. We then use offsets to allow the user to add more
 information to specify it more precisely.
 

That would be one way of dealing with irregularly spaced data. I would argue
that the example is somewhat back-to-front though. If something happens
twice a month it's not occuring at a monthly frequency, but at a higher
frequency. In this case the lowest frequency which can capture this data is
daily frequency so it sould be stored at daily frequency and if monthly
statistics are required the series can be aggregated up after the fact. e.g.

In [2]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M')

In [3]: dates = dates.asfreq('D','START')

In [4]: dates = (dates[:,None] + np.tile(array([4, 19]), [12, 1])).ravel()

In [5]: data = 100 + 10*randn(12, 2)

In [6]: payments = ts.time_series(data.ravel(), dates)

In [7]: payments
Out[7]:
timeseries([ 103.76588849  101.29566771   91.10363573  101.90578443  
102.12588909
   89.86413807   94.89200485   93.69989375  103.37375202  104.7628273
   97.45956699   93.39594431   94.79258639  102.90656477   87.42346985
   91.43556069   95.21947628   93.0671271   107.07400065   92.0835356
   94.11035154   86.66521318  109.36556861  101.69789341],
   dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 
20-Mar-2011
 05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011
 05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011
 05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011],
   freq  = D)


In [8]: payments.convert('M', ma.count)
Out[8]:
timeseries([2 2 2 2 2 2 2 2 2 2 2 2],
   dates = [Jan-2011 ... Dec-2011],
   freq  = M)


In [9]: payments.convert('M', ma.sum)
Out[9]:
timeseries([205.061556202 193.009420163 191.990027161 188.591898598 
208.136579315
 190.855511303 197.699151161 178.859030538 188.286603379 199.157536259
 180.775564724 211.063462017],
   dates = [Jan-2011 ... Dec-2011],
   freq  = M)


Alternatively for a fixed number of events per-period the values can just be
stored in a 2D array - e.g.

In [10]: dates = ts.date_array('01-Jan-2011','01-Dec-2011', freq='M')

In [11]: payments = ts.time_series(data, dates)

In [12]: payments
Out[12]:
timeseries(
 [[ 103.76588849  101.29566771]
 [  91.10363573  101.90578443]
 [ 102.12588909   89.86413807]
 [  94.89200485   93.69989375]
 [ 103.37375202  104.7628273 ]
 [  97.45956699   93.39594431]
 [  94.79258639  102.90656477]
 [  87.42346985   91.43556069]
 [  95.21947628   93.0671271 ]
 [ 107.07400065   92.0835356 ]
 [  94.11035154   86.66521318]
 [ 109.36556861  101.69789341]],
dates =
 [Jan-2011 ... Dec-2011],
freq  = M)


In [13]: payments.sum(1)
Out[13]:
timeseries([ 205.0615562   193.00942016  191.99002716  188.5918986   
208.13657931
  190.8555113   197.69915116  178.85903054  188.28660338  199.15753626
  180.77556472  211.06346202],
   dates = [Jan-2011 ... Dec-2011],
   freq  = M)


It seems to me that either of these would satisfy the use-case with the added
benefit of simplifying the datetime implementation. That said I'm not against
the proposal if it provides someone with some benefit...

Regarding the default conversions, the start of the interval is a perfectly
acceptable default (and my preferred choice) however being able to specify
the end is also useful as for the month-day conversion the end of each month
can't be specified by a fixed offset from the start because of their varying
lengths. Of course this can be found by subtracting 1 from the start of the next
month: 

(M + 1).asfreq('D',offset=0) - 1

but just as it's easier to write List[-1] rather than List[List.length -1] it's
easier to write

M.asfreq('D

Re: [Numpy-discussion] fixing up datetime

2011-06-07 Thread Dave Hirschfeld

Christopher Barker Chris.Barker at noaa.gov writes:

 
 Dave Hirschfeld wrote:
  That would be one way of dealing with irregularly spaced data. I would argue
  that the example is somewhat back-to-front though. If something happens
  twice a month it's not occuring at a monthly frequency, but at a higher
  frequency.
 
 And that frequency is 2/month.
 
  In this case the lowest frequency which can capture this data is
  daily frequency so it should be stored at daily frequency
 
 I don't think it should, because it isn't 1/15 days, or, indeed, an 
 frequency that can be specified as days. Sure you can specify the 5th 
 and 20th of each month in a given time span in terms of days since an 
 epoch, but you've lost some information there. When you want to do math 
 -- like add a certain number of 1/2 months -- when is the 100th payment due?
 It seems keeping it in M/2 is the most natural way to deal with that 
 -- then you don't need special code to do that addition, only when 
 converting to a string (or other format) datetime.

With a monthly frequency you can't represent 2/month except using a 2D array
as in my second example. This does however discard the information about when in
the month the payments occurred/are due.

For that matter I'm not sure where the information that the events occur on the
5th and the 20th is recorded in the dtype [M]//2? Maybe my mental model is too
fixed in the scikits.timeseries mode and I need to try out the new code...

You're right that 2 per month isn't a daily frequency however as you noted the
data can be stored in a daily frequency array with missing values for each date
where a payment doesn't occur.

The timeseries package uses a masked array to represent the missing data however
you aren't required to have data for every day, only when you call the
.fill_missing_dates() function is the array expanded to include a datapoint for
every period.

As shown below the timeseries is at a daily frequency but only datapoints for
the 5th and 20th of the month are included:

In [21]: payments
Out[21]:
timeseries([ 103.76588849  101.29566771   91.10363573  101.90578443  102.125889
   89.86413807   94.89200485   93.69989375  103.37375202  104.7628273
   97.45956699   93.39594431   94.79258639  102.90656477   87.42346985
   91.43556069   95.21947628   93.0671271   107.07400065   92.0835356
   94.11035154   86.66521318  109.36556861  101.69789341],
   dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 
   20-Mar-2011
 05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011
 05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011
 05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011],
   freq  = D)

If I want to see when the 5th payment is due I can simply do:

In [26]: payments[4:5]
Out[26]:
timeseries([ 102.12588909],
   dates = [05-Mar-2011],
   freq  = D)

Advancing the payments by a fixed number of days is possible:

In [28]: payments.dates[:] += 3

In [29]: payments.dates
Out[29]:
DateArray([08-Jan-2011, 23-Jan-2011, 08-Feb-2011, 23-Feb-2011, 08-Mar-2011,
   23-Mar-2011, 08-Apr-2011, 23-Apr-2011, 08-May-2011, 23-May-2011,
   08-Jun-2011, 23-Jun-2011, 08-Jul-2011, 23-Jul-2011, 08-Aug-2011,
   23-Aug-2011, 08-Sep-2011, 23-Sep-2011, 08-Oct-2011, 23-Oct-2011,
   08-Nov-2011, 23-Nov-2011, 08-Dec-2011, 23-Dec-2011],
  freq='D')


Starting 3 payments in the future is more difficult and would require the date
array to be recreated with the new starting date. One way of doing that would 
be:

In [42]: dates = ts.date_array(payments.dates[2], length=(31*payments.size)//2)

In [43]: dates = dates[(dates.day == 8) | (dates.day == 23)][0:payments.size]

In [44]: dates
Out[44]:
DateArray([08-Feb-2011, 23-Feb-2011, 08-Mar-2011, 23-Mar-2011, 08-Apr-2011,
   23-Apr-2011, 08-May-2011, 23-May-2011, 08-Jun-2011, 23-Jun-2011,
   08-Jul-2011, 23-Jul-2011, 08-Aug-2011, 23-Aug-2011, 08-Sep-2011,
   23-Sep-2011, 08-Oct-2011, 23-Oct-2011, 08-Nov-2011, 23-Nov-2011,
   08-Dec-2011, 23-Dec-2011, 08-Jan-2012, 23-Jan-2012],
  freq='D')

In [45]: dates.shape
Out[45]: (24,)

A bit messy - I'll have to look at the numpy implementation to how it improves
the situation...

Regards,
Dave




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Changing the datetime operation unit rules

2011-06-07 Thread Dave Hirschfeld

Mark Wiebe mwwiebe at gmail.com writes:

 
  a = np.datetime64('today')
 
  a - a.astype('M8[Y]')
 
 numpy.timedelta64(157,'D')
 
 vs
 
 
  a = np.datetime64('today')
  a - a.astype('M8[Y]')
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: ufunc subtract cannot use operands with types
dtype('datetime64[D]') and dtype('datetime64[Y]')
 
 
 Cheers,
 Mark
 
 

I think I like the latter behaviour actually. The first one implicitly assumes 
I want the yearly frequency converted to the start of the period. I think the 
user should have to explicitly convert the dates to a common representation 
before performing arithmetic on them.

It seems to me that if you didn't want your data to be silently cast to the
start of the period it would be very difficult to determine if this had 
happened or not.

As an example consider the pricing of a stock using the black-scholes formula.
The value is dependant on the time to maturity which is the expiry date minus
the current date. If the expiry date was accidentally specified at (or converted
to) a monthly frequency and the current date was a daily frequency taking the
difference will give you the wrong number of days until expiry resulting in a
subtly different answer - not good.

NB: The timeseries Date/DateArray have a day_of_year attribute which is very 
useful.

Regards,
Dave





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy speed question

2010-11-25 Thread Dave Hirschfeld

Jean-Luc Menut jeanluc.menut at free.fr writes:

 I have a little question about the speed of numpy vs IDL 7.0.
  
 Here the IDL result:
 % Compiled module: $MAIN$.
 2.837
 
 The python code:
 from numpy import *
 from time import time
 time1 = time()
 for j in range(1):
  for i in range(1000):
  a=cos(2*pi*i/100.)
 time2 = time()
 print time2-time1
 
 result:
 In [2]: run python_test_speed.py
 24.180943
 

Whilst you've imported everything from numpy you're not really using numpy -
you're still using a slow Python (double) loop. The power of numpy comes from
vectorising your code - i.e. applying functions to arrays of data.

The example below demonstrates an 80 fold increase in speed by vectorising the
calculation:

def method1():
a = empty([1000, 1])
for j in range(1):
 for i in range(1000):
 a[i,j] = cos(2*pi*i/100.)
return a
#

def method2():
ij = np.repeat((2*pi*np.arange(1000)/100.)[:,None], 1, axis=1)
return np.cos(ij)
#


In [46]: timeit method1()
1 loops, best of 3: 47.9 s per loop

In [47]: timeit method2()
1 loops, best of 3: 589 ms per loop

In [48]: allclose(method1(), method2())
Out[48]: True



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] How to import input data to make nda rray for batch processing?

2010-11-18 Thread Dave Hirschfeld

Venkat dvr002 at gmail.com writes:

 
 Hi All,I am new to Numpy (also Scipy).I am trying to reshape my text data
which is in one single column (10,000 rows).I want the data to be in 100x100
array form.I have many files to convert like this. All of them are having file
names like 0, 1, 2, 500. with out any extension.
 Actually, I renamed actual files so that I can import them in Matlab for batch
processing.Since Matlab also new for me, I thought I will try Numpy first. Can
any body help me in writing the script to do this for making batch processing.
Thanks in advance,Venkat

In [2]: dummy_data = np.random.randn(100,100)

In [3]: dummy_data.shape
Out[3]: (100, 100)

In [4]: dummy_data.flatten().shape
Out[4]: (1,)

In [5]: np.savetxt('dummy_data.txt', dummy_data.flatten())

In [6]: !less dummy_data.txt
2.571031186906808100e-01
1.566790681796508500e+00
-6.846267829937818800e-01
3.271332705287631200e-01
-7.482409829656505600e-02
1.429095432454441600e-01
-6.41694801869400e-01
-5.319842186383831900e-01
-4.047786844569442600e-01
-6.696045994533519300e-01
-4.895085917712171400e-01
6.969419814656118200e-01
6.656815445278644300e-01
7.455135053686292600e-01
...

In [7]: data = np.loadtxt('dummy_data.txt')

In [8]: data.shape
Out[8]: (1,)

In [9]: data = data.reshape(100, 100)

In [10]: data.shape
Out[10]: (100, 100)

In [11]: np.allclose(dummy_data, data)
Out[11]: True

HTH,
Dave


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Printing formatted numerical values

2010-11-15 Thread Dave Hirschfeld

 pv+numpy at math.duke.edu writes:

 
 Hi, what is the best way to print (to a file or to stdout) formatted 
 numerical values? Analogously to C's printf(%d %g,x,y) etc?
 

For stdout you can simply do:

In [26]: w, x, y, z = np.randint(0,100,4)

In [27]: type(w)
Out[27]: type 'numpy.int32'

In [28]: print(%f %g %e %d % (w,x,y,z,))
68.00 64 1.30e+01 57

In [29]: w, x, y, z
Out[29]: (68, 64, 13, 57)

For a file I would use np.savetxt

HTH,
Dave



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Unpleasant behavior with poly1d and numpy scalar multiplication

2010-02-15 Thread Dave Hirschfeld

Charles R Harris charlesr.harris at gmail.com writes:

 I was also thinking that someone might want to provide a better display at
 some point, drawing on a canvas, for instance. And what happens when the
 degree gets up over 100, which is quite reasonable with the Cheybshev 
 polynomials?

There may well be better ways to do it but I've found the following function to
be quite handy for visualising latex equations:

def eqview(expr,fontsize=28,dpi=80):
IS_INTERACTIVE = is_interactive()
try:
interactive(False)
fig = figure(dpi=dpi, facecolor='w')
h = figtext(0.5, 0.5, latex,
fontsize = fontsize,
horizontalalignment = 'center',
verticalalignment = 'center')
bbox = h.get_window_extent(RendererAgg(15,15,dpi))
fig.set_size_inches(1.1*bbox.width/dpi, 1.25*bbox.height/dpi)
show()
finally:
interactive(IS_INTERACTIVE)

NB: Sympy provides the latex function to convert the equation objects into 
latex as well as other ways to display the objects in the sympy.printing 
module. It shouldn't be too hard to do something similar if someone was so 
inclined!

HTH,
Dave

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

40 matches

Mail list logo