Re: [Numpy-discussion] Raveling, reshape order keyword unnecessarily confuses index and memory ordering
On Tue, Apr 2, 2013 at 9:09 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Tue, Apr 2, 2013 at 7:09 PM, josef.p...@gmail.com wrote: On Tue, Apr 2, 2013 at 5:52 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 2, 2013 at 10:21 PM, Matthew Brett matthew.br...@gmail.com wrote: This is like observing that if I say go North then it's ambiguous about whether I want you to drive or walk, and concluding that we need new words for the directions depending on what sort of vehicle you use. So go North means drive North, go htuoS means walk North, etc. Totally silly. Makes much more sense to have one set of words for directions, and then make clear from context what the directions are used for -- drive North, walk North. Or iterate C-wards, store F-wards. C and Z mean exactly the same thing -- they describe a way of unraveling a cube into a straight line. The difference is what we do with the resulting straight line. That's why I'm suggesting that the distinction should be made in the name of the argument. Could you unpack that for the 'ravel' docstring? Because these options all refer to the way of unraveling and not the memory layout that results. Z/C/column-major/whatever-you-want-to-call-it is a general strategy for converting between a 1-dim representation and a n-dim representation. In the case of memory storage, the 1-dim representation is the flat space of pointer arithmetic. In the case of ravel, the 1-dim representation is the flat space of a 1-dim indexed array. But the 1-dim-to-n-dim part is the same in both cases. I think that's why you're seeing people baffled by your proposal -- to them the C refers to this general strategy, and what's different is the context where it gets applied. So giving the same strategy two different names is silly; if anything it's the contexts that should have different names. And once we get into memory optimization (and avoiding copies and preserving contiguity), it is necessary to keep both orders in mind, is memory order in F and am I iterating/raveling in F order (or slicing columns). I think having two separate keywords give the impression we can choose two different things at the same time. I guess it could not make sense to do this: np.ravel(a, index_order='C', memory_order='F') It could make sense to do this: np.reshape(a, (3,4), index_order='F, memory_order='F') but that just points out the inherent confusion between the uses of 'order', and in this case, the fact that you can only do: np.reshape(a, (3, 4), index_order='F') correctly distinguishes between the meanings. So, if index_order and memory_order are never in the same function, then the context should be enough. It was always enough for me. np.reshape(a, (3,4), index_order='F, memory_order='F') really hurts my head because you mix a function that operates on views, indexing and shapes with memory creation, (or I have no idea what memory_order should do in this case). np.asarray(a.reshape(3,4 order=F), order=F) or the example here http://docs.scipy.org/doc/numpy/reference/generated/numpy.asfortranarray.html?highlight=asfortranarray#numpy.asfortranarray http://docs.scipy.org/doc/numpy/reference/generated/numpy.asarray.html keeps functions with index_order and functions with memory_order nicely separated. (It might be useful but very confusing to add memory_order to every function that creates a view if possible and a copy if necessary: If you have to make a copy, then I want F memory order, otherwise give me a view But I cannot find a candidate function right now, except for ravel and reshape see first notes in docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html ) a day later (haven't changed my mind): isn't specifying index order in the Parameter section enough as an explanation? something like: ``` def ravel Parameters order : index order how the array is stacked into a 1d array. F means we stack by columns (fortran order, first index first),C means we stack by rows (c-order, last index first) ``` most array *creation* functions explicitly mention memory layout in the docstring Josef Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
Andreas Hilboll lists at hilboll.de writes: I think your point about using current timezone in interpreting user input being dangerous is probably correct --- perhaps UTC all the way would be a safer (and simpler) choice? +1 +10 from me! I've recently come across a bug due to the fact that numpy interprets dates as being in the local timezone. The data comes from a database query where there is no timezone information supplied (and dates are stored as strings). It is assumed that the user doesn't need to know the timezone - i.e. the dates are timezone naive. Working out the correct timezones would be fairly laborious, but whatever the correct timezones are, they're certainly not the timezone the current user happens to find themselves in! e.g. In [32]: rs = [ ...: (u'2000-01-17 00:00:00.00', u'2000-02-01', u'2000-02-29', 0.1203), ...: (u'2000-01-26 00:00:00.00', u'2000-02-01', u'2000-02-29', 0.1369), ...: (u'2000-01-18 00:00:00.00', u'2000-03-01', u'2000-03-31', 0.1122), ...: (u'2000-02-25 00:00:00.00', u'2000-03-01', u'2000-03-31', 0.1425) ...: ] ...: dtype = [('issue_date', 'datetime64[ns]'), ...: ('start_date', 'datetime64[D]'), ...: ('end_date', 'datetime64[D]'), ...: ('value', float)] ...: # In [33]: # What I see in London, UK ...: recordset = np.array(rs, dtype=dtype) ...: df = pd.DataFrame(recordset) ...: df = df.set_index('issue_date') ...: df ...: Out[33]: start_dateend_date value issue_date 2000-01-17 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1203 2000-01-26 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1369 2000-01-18 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1122 2000-02-25 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1425 In [34]: # What my colleague sees in Auckland, NZ ...: recordset = np.array(rs, dtype=dtype) ...: df = pd.DataFrame(recordset) ...: df = df.set_index('issue_date') ...: df ...: Out[34]: start_dateend_date value issue_date 2000-01-16 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1203 2000-01-25 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1369 2000-01-17 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1122 2000-02-24 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1425 Oh dear! This isn't acceptable for my use case (in a multinational company) and I found no reasonable way around it other than bypassing the numpy conversion entirely by setting the dtype to object, manually parsing the strings and creating an array from the list of datetime objects. Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
On Wed, Apr 3, 2013 at 2:26 PM, Dave Hirschfeld dave.hirschf...@gmail.com wrote: Andreas Hilboll lists at hilboll.de writes: I think your point about using current timezone in interpreting user input being dangerous is probably correct --- perhaps UTC all the way would be a safer (and simpler) choice? +1 +10 from me! I've recently come across a bug due to the fact that numpy interprets dates as being in the local timezone. The data comes from a database query where there is no timezone information supplied (and dates are stored as strings). It is assumed that the user doesn't need to know the timezone - i.e. the dates are timezone naive. Working out the correct timezones would be fairly laborious, but whatever the correct timezones are, they're certainly not the timezone the current user happens to find themselves in! e.g. In [32]: rs = [ ...: (u'2000-01-17 00:00:00.00', u'2000-02-01', u'2000-02-29', 0.1203), ...: (u'2000-01-26 00:00:00.00', u'2000-02-01', u'2000-02-29', 0.1369), ...: (u'2000-01-18 00:00:00.00', u'2000-03-01', u'2000-03-31', 0.1122), ...: (u'2000-02-25 00:00:00.00', u'2000-03-01', u'2000-03-31', 0.1425) ...: ] ...: dtype = [('issue_date', 'datetime64[ns]'), ...: ('start_date', 'datetime64[D]'), ...: ('end_date', 'datetime64[D]'), ...: ('value', float)] ...: # In [33]: # What I see in London, UK ...: recordset = np.array(rs, dtype=dtype) ...: df = pd.DataFrame(recordset) ...: df = df.set_index('issue_date') ...: df ...: Out[33]: start_dateend_date value issue_date 2000-01-17 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1203 2000-01-26 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1369 2000-01-18 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1122 2000-02-25 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1425 In [34]: # What my colleague sees in Auckland, NZ ...: recordset = np.array(rs, dtype=dtype) ...: df = pd.DataFrame(recordset) ...: df = df.set_index('issue_date') ...: df ...: Out[34]: start_dateend_date value issue_date 2000-01-16 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1203 2000-01-25 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00 0.1369 2000-01-17 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1122 2000-02-24 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00 0.1425 Oh dear! This isn't acceptable for my use case (in a multinational company) and I found no reasonable way around it other than bypassing the numpy conversion entirely by setting the dtype to object, manually parsing the strings and creating an array from the list of datetime objects. Wow, that's truly broken. I'm sorry. I'm skeptical that just switching to UTC everywhere is actually the right solution. It smells like one of those solutions that's simple, neat, and wrong. (I don't know anything about calendar-time series handling, so I have no ability to actually judge this stuff, but wouldn't one problem be if you want to know about business days/hours? You lose the original day-of-year once you move everything to UTC.) Maybe datetime dtypes should be parametrized by both granularity and timezone? Or we could just declare that datetime64 is always timezone-naive and adjust the code to match? I'll CC the pandas list in case they have some insight. Unfortunately AFAIK no-one who's regularly working on numpy this point works with datetimes, so we have limited ability to judge solutions... please help! -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
Nathaniel Smith njs at pobox.com writes: On Wed, Apr 3, 2013 at 2:26 PM, Dave Hirschfeld dave.hirschfeld at gmail.com wrote: This isn't acceptable for my use case (in a multinational company) and I found no reasonable way around it other than bypassing the numpy conversion entirely by setting the dtype to object, manually parsing the strings and creating an array from the list of datetime objects. Wow, that's truly broken. I'm sorry. I'm skeptical that just switching to UTC everywhere is actually the right solution. It smells like one of those solutions that's simple, neat, and wrong. (I don't know anything about calendar-time series handling, so I have no ability to actually judge this stuff, but wouldn't one problem be if you want to know about business days/hours? You lose the original day-of-year once you move everything to UTC.) Maybe datetime dtypes should be parametrized by both granularity and timezone? Or we could just declare that datetime64 is always timezone-naive and adjust the code to match? I'll CC the pandas list in case they have some insight. Unfortunately AFAIK no-one who's regularly working on numpy this point works with datetimes, so we have limited ability to judge solutions... please help! -n It think simply setting the timezone to UTC if it's not specified would solve 99% of use cases because IIUC the internal representation is UTC so numpy would be doing no conversion of the dates that were passed in. It was the conversion which was the source of the error in my example. The only potential issue with this is that the dates might take along an incorrect UTC timezone, making it more difficult to work with naive datetimes. e.g. In [42]: d = np.datetime64('2014-01-01 00:00:00', dtype='M8[ns]') In [43]: d Out[43]: numpy.datetime64('2014-01-01T00:00:00+') In [44]: str(d) Out[44]: '2014-01-01T00:00:00+' In [45]: pydate(str(d)) Out[45]: datetime.datetime(2014, 1, 1, 0, 0, tzinfo=tzutc()) In [46]: pydate(str(d)) == datetime.datetime(2014, 1, 1) Traceback (most recent call last): File ipython-input-46-abfc0fee9b97, line 1, in module pydate(str(d)) == datetime.datetime(2014, 1, 1) TypeError: can't compare offset-naive and offset-aware datetimes In [47]: pydate(str(d)) == datetime.datetime(2014, 1, 1, tzinfo=tzutc()) Out[47]: True In [48]: pydate(str(d)).replace(tzinfo=None) == datetime.datetime(2014, 1, 1) Out[48]: True In this case it may be best to have numpy not try to set the timezone at all if none was specified. Given the internal representation is UTC I'm not sure this is feasible though so defaulting to UTC may be the best solution. Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Raveling, reshape order keyword unnecessarily confuses index and memory ordering
On Wed, Apr 3, 2013 at 6:24 AM, Sebastian Berg sebast...@sipsolutions.net wrote: the context where it gets applied. So giving the same strategy two different names is silly; if anything it's the contexts that should have different names. Yup, thats how I think about it too... me too... But I would really love if someone would try to make the documentation simpler! yes, I think this is where the solution lies. There is also never a mention of contiguity, even though when we refer to memory order, then having a C/F contiguous array is often the reason why good point -- in fact, I have no idea what would happen in many of these cases for a discontiguous array (or one with arbitrarily weird strides...) Also 'A' seems often explained not quite correctly (though that does not matter (except for reshape, where its explanation is fuzzy), it will matter more in the future -- even if I don't expect 'A' to be actually used). I wonder about having a 'A' option in reshape at all -- what the heck does it mean? why do we need it? Again, I come back to the fact that memory order is kind-of orthogonal to index order. So for reshape (or ravel, which is really just a special case of reshape...) the 'A' flag and 'K' flag (huh?) is pretty dangerous, and prone to error. I think of it this way: Much of the beauty of numpy is that it presents a consistent interface to various forms of strided data -- that way, folks can write code that works the same way for any ndarray, while still being able to have internal storage be efficient for the use at hand -- i.e. C order for the common case, Fortran order for interaction with libraries that expect that order (or for algorithms that are more efficient in that order, though that's mostly external libs..), and non-contiguous data so one can work on sub-parts of arrays without copying data around. In most places, the numpy API hides the internal memory order -- this is a good thing, most people have no need to think about it (or most code, anyway), and you can write code that works (even if not optimally) for any (strided) memory layout. All is good. There are times when you really need to understand, or control or manipulate the memory layout, to make sure your routines are optimized, or the data is in the right form to pass of to an external lib, or to make sense of raw data read from a file, or... That's what we have .view() and friends for. However, the 'A' and 'K' flags mix and match these concepts -- and I think that's dangerous. it would be easy for the a to use the 'A' flag, and have everything work fine and dandy with all their test cases, only to have it blow up when someone passes in a different-than-expected array. So really, they should only be used in cases where the code has checked memory order before hand, or in a really well-defined interface where you know exactly what you're getting. In those cases, it makes the code far more clear an less error prone to do you re-arranging of the memory in a separate step, rather than built-in to a ravel() or reshape() call. [note] -- I wrote earlier that I wasn't confused by the ravel() examples -- true for teh 'c' and 'F' flags, but I'm still not at all clear what 'A' and 'K' woudl give me -- particularly for 'A' and reshape() So I think the cause of the confusion here is not that we use order in two different contexts, nor the fact that 'C' and 'F' may not mean anything to some people, but that we are conflating two different process in one function, and with one flag. My (maybe) proposal: we deprecate the 'A' and 'K' flags in ravel() and reshape(). (maybe even deprecate ravel() -- does it add anything to reshape? If not deprecate, at least encourage people in the docs not to use them, and rather do their memory-structure manipulations with .view or stride manipulation, or... I'm still trying to figure out when you'd want the 'A' flag -- it seems at the end of your operation you will want: The resulting array to be a particular shape, with the elements in a particular order and You _may_ want the in-memory layout a certain way. but 'A' can't ensure both of those. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
dave.hirschf...@gmail.com wrote: I found no reasonable way around it other than bypassing the numpy conversion entirely Exactly - we have come to the same conclusion. By the way, it's also consistent -- an ISO string without a TZ is interpreted as a to mean use the locale, but a datetime object without a TZ is interpreted as UTC, so you get this: In [68]: dt Out[68]: datetime.datetime(2013, 4, 3, 12, 0) In [69]: np.dateti np.datetime64 np.datetime_as_string np.datetime_data In [69]: np.datetime64(dt) Out[69]: numpy.datetime64('2013-04-03T05:00:00.00-0700') In [70]: np.datetime64(dt.iso) dt.isocalendar dt.isoformatdt.isoweekday In [70]: np.datetime64(dt.isoformat()) Out[70]: numpy.datetime64('2013-04-03T12:00:00-0700') two different results! (and as it happens, datetime.datetime does not have an ISO string parser, so it's not completely trivial to round-trip though that...) On Wed, Apr 3, 2013 at 6:49 AM, Nathaniel Smith n...@pobox.com wrote: Wow, that's truly broken. I'm sorry. Did you put this in? break out the pitchforks! ( ;-) ) I'm skeptical that just switching to UTC everywhere is actually the right solution. It smells like one of those solutions that's simple, neat, and wrong. well, actually, I don't think UTC everywhere is quite what's proposed -- really it's naive datetimes -- it would be up to the user/application to make sure the time zones are consistent. Which does mean that parsing a ISO string with a timezone becomes problematic... (I don't know anything about calendar-time series handling, so I have no ability to actually judge this stuff, but wouldn't one problem be if you want to know about business days/hours? right -- then you'd want to use local time, so numpy might think it's ISO, but it'd actually be local time. Anyway, at the moment, I don't think datetime64 does this right anyway. I don't see mention of the timezone in the busday functions. I havne't checked to see if they use the locale TZ or ignore it, but either way is wrong (actually, using the locale setting is worse...) Maybe datetime dtypes should be parametrized by both granularity and timezone? That may be a good option. However, I suspect it's pretty hard to actually use the timezone correctly and consistently, so Im nervous about that. In any case, we'd need to make sure that the user could specify timezone on I/O and busday calculations, etc, and *never* assume the locale TZ (Or anything else about locale) unless asked for. Using the locale TZ is almost never the right thing to do for the kind of applications numpy is used for. Or we could just declare that datetime64 is always timezone-naive and adjust the code to match? That would be the easy way to handle it -- from the numpy side, anyway. I'll CC the pandas list in case they have some insight. I suspect pandas has their own way of dealing with all these issues already. Which makes me think that numpy should take the same approach as the python stdlib: provide a core datatype, but leave the use-case specific stuff for others to build on. For instance, it seems really odd to have the busday* functions in core numpy... Unfortunately AFAIK no-one who's regularly working on numpy this point works with datetimes, so we have limited ability to judge solutions... well, that explains how this happened! please help! in 1.7, it is still listed as experimental, so you could say this is all going as planned: release something we can try to use, and see what we find out when using it! I _think_ one reasonable option may be: 1) Internal is UTC 2) On input: a) Default for no-time-zone-specified is UTC (both from datetime objects and ISO strings) b) respect TZ if given, converting to UTC 3) On output: a) default to UTC a) provide a way for the user to specify the timezone desired ( perhaps a TZ attribute somewhere, or functions to specifically convert to ISO strings and datetime objects that take an optional TZ parameter.) 4) busday* and the like allow a way to specify TZ Issues I immediate see with this: Respecting the TZ on output is a problem because: 1) if people want naive datetimes, they will get UTC ISO strings, i.e.: '2013-04-03T05:00:00Z' rather than '2013-04-03T05:00:00' - so there should be a way to specify naive or None as a timezone. 2) the python datetime module doesn't have any tzinfo objects built in -- so to respect timezones, numpy would need to maintain its own, or depend on pytz Given all this, maybe naive is the way to go, perhaps mirroring datetime,datetime, an having an optional tzinfo object attribute. (by the way, I'm confused where that would live -- in the dtype instance? in the array? Issue with Naive: what do you do with an ISO string that specifies a TZ offset? I'm beginning to see why the datetime doesn't support reading ISO strings -- it would need to deal with timezones in that case! Another note about Timezones and ISO
Re: [Numpy-discussion] timezones and datetime64
Mark Wiebe and I are both still tracking NumPy development and can provide context and even help when needed.Apologies if we've left a different impression. We have to be prudent about the time we spend as we have other projects we are pursuing as well, but we help clients with NumPy issues all the time and are eager to continue to improve the code base. It seems to me that the biggest issue is just the automatic conversion that is occurring on string or date-time input. We should stop using the local time-zone (explicit is better than implicit strikes again) and not use any time-zone unless time-zone information is provided in the string. I am definitely +1 on that. It may be necessary to carry around another flag in the data-type to indicate whether or not the date-time is naive (not time-zone aware) or time-zone aware so that string printing does not print a time-zone if it didn't have one to begin with as well. If others agree that this is the best way forward, then Mark or I can definitely help contribute a patch. Best, -Travis On Wed, Apr 3, 2013 at 9:38 AM, Dave Hirschfeld dave.hirschf...@gmail.comwrote: Nathaniel Smith njs at pobox.com writes: On Wed, Apr 3, 2013 at 2:26 PM, Dave Hirschfeld dave.hirschfeld at gmail.com wrote: This isn't acceptable for my use case (in a multinational company) and I found no reasonable way around it other than bypassing the numpy conversion entirely by setting the dtype to object, manually parsing the strings and creating an array from the list of datetime objects. Wow, that's truly broken. I'm sorry. I'm skeptical that just switching to UTC everywhere is actually the right solution. It smells like one of those solutions that's simple, neat, and wrong. (I don't know anything about calendar-time series handling, so I have no ability to actually judge this stuff, but wouldn't one problem be if you want to know about business days/hours? You lose the original day-of-year once you move everything to UTC.) Maybe datetime dtypes should be parametrized by both granularity and timezone? Or we could just declare that datetime64 is always timezone-naive and adjust the code to match? I'll CC the pandas list in case they have some insight. Unfortunately AFAIK no-one who's regularly working on numpy this point works with datetimes, so we have limited ability to judge solutions... please help! -n It think simply setting the timezone to UTC if it's not specified would solve 99% of use cases because IIUC the internal representation is UTC so numpy would be doing no conversion of the dates that were passed in. It was the conversion which was the source of the error in my example. The only potential issue with this is that the dates might take along an incorrect UTC timezone, making it more difficult to work with naive datetimes. e.g. In [42]: d = np.datetime64('2014-01-01 00:00:00', dtype='M8[ns]') In [43]: d Out[43]: numpy.datetime64('2014-01-01T00:00:00+') In [44]: str(d) Out[44]: '2014-01-01T00:00:00+' In [45]: pydate(str(d)) Out[45]: datetime.datetime(2014, 1, 1, 0, 0, tzinfo=tzutc()) In [46]: pydate(str(d)) == datetime.datetime(2014, 1, 1) Traceback (most recent call last): File ipython-input-46-abfc0fee9b97, line 1, in module pydate(str(d)) == datetime.datetime(2014, 1, 1) TypeError: can't compare offset-naive and offset-aware datetimes In [47]: pydate(str(d)) == datetime.datetime(2014, 1, 1, tzinfo=tzutc()) Out[47]: True In [48]: pydate(str(d)).replace(tzinfo=None) == datetime.datetime(2014, 1, 1) Out[48]: True In this case it may be best to have numpy not try to set the timezone at all if none was specified. Given the internal representation is UTC I'm not sure this is feasible though so defaulting to UTC may be the best solution. Regards, Dave ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- --- Travis Oliphant Continuum Analytics, Inc. http://www.continuum.io ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Raveling, reshape order keyword unnecessarily confuses index and memory ordering
Hi, On Wed, Apr 3, 2013 at 5:19 AM, josef.p...@gmail.com wrote: On Tue, Apr 2, 2013 at 9:09 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Tue, Apr 2, 2013 at 7:09 PM, josef.p...@gmail.com wrote: On Tue, Apr 2, 2013 at 5:52 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 2, 2013 at 10:21 PM, Matthew Brett matthew.br...@gmail.com wrote: This is like observing that if I say go North then it's ambiguous about whether I want you to drive or walk, and concluding that we need new words for the directions depending on what sort of vehicle you use. So go North means drive North, go htuoS means walk North, etc. Totally silly. Makes much more sense to have one set of words for directions, and then make clear from context what the directions are used for -- drive North, walk North. Or iterate C-wards, store F-wards. C and Z mean exactly the same thing -- they describe a way of unraveling a cube into a straight line. The difference is what we do with the resulting straight line. That's why I'm suggesting that the distinction should be made in the name of the argument. Could you unpack that for the 'ravel' docstring? Because these options all refer to the way of unraveling and not the memory layout that results. Z/C/column-major/whatever-you-want-to-call-it is a general strategy for converting between a 1-dim representation and a n-dim representation. In the case of memory storage, the 1-dim representation is the flat space of pointer arithmetic. In the case of ravel, the 1-dim representation is the flat space of a 1-dim indexed array. But the 1-dim-to-n-dim part is the same in both cases. I think that's why you're seeing people baffled by your proposal -- to them the C refers to this general strategy, and what's different is the context where it gets applied. So giving the same strategy two different names is silly; if anything it's the contexts that should have different names. And once we get into memory optimization (and avoiding copies and preserving contiguity), it is necessary to keep both orders in mind, is memory order in F and am I iterating/raveling in F order (or slicing columns). I think having two separate keywords give the impression we can choose two different things at the same time. I guess it could not make sense to do this: np.ravel(a, index_order='C', memory_order='F') It could make sense to do this: np.reshape(a, (3,4), index_order='F, memory_order='F') but that just points out the inherent confusion between the uses of 'order', and in this case, the fact that you can only do: np.reshape(a, (3, 4), index_order='F') correctly distinguishes between the meanings. So, if index_order and memory_order are never in the same function, then the context should be enough. It was always enough for me. It was not enough for me or the three others who will publicly admit to the shame of finding it confusing without further thought. Again, I just can't see a reason not to separate these ideas. We are not arguing about backwards compatibility here, only about clarity. I guess you do accept that some people, other than yourself, might be less likely to get tripped up by: np.reshape(a, (3, 4), index_order='F') than np.reshape(a, (3, 4), order='F') ? np.reshape(a, (3,4), index_order='F, memory_order='F') really hurts my head because you mix a function that operates on views, indexing and shapes with memory creation, (or I have no idea what memory_order should do in this case). Right. I think you may now be close to my own discomfort when faced with working out (fast) what: np.reshape(a, (3,4), order='F') means, given 'order' means two different things, and both might be relevant here. Or are you saying that my brain should have quickly calculated that that 'order' would be difficult to understand as memory layout and therefore rejected that and seen immediately that index order was the meaning? Speaking as a psychologist, I don't think that's the way it works. Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] try to solve issue #2649 and revisit #473
Hello, all I try to solve issue 2649 which is related to 473 on multiplication of a matrix and an array. As 2649 shows import numpy as np x = np.arange(5) I = np.asmatrix(np.identity(5)) print np.dot(I, x).shape # - (1, 5) First of all I assume we expect that I.dot(x) and I * x behave the same, so I suggest add function dot to matrix, like def dot(self, other): return self * other Then the major issue is the constructor of array and matrix interpret a list differently. array([0,1]).shape = (2,) and matrix([0,1]).shape = (1, 2). It will throw error when run np.dot(I, x), because in __mul__, x will be converted to a 1*5 matrix first. It's not consistent with np.dot(np.identity(5), x), which returns x. To fix that, I suggest to check the dimension of array when convert it to matrix. If it's 1D array, then convert it to a vertical vector explicitly like this if isinstance(data, N.ndarray): + if len(data.shape) == 1: + data = data.reshape(data.shape[0], 1) if dtype is None: intype = data.dtype else: Any comments? -- Kan Huang Department of Applied math Statistics Stony Brook University 917-767-8018 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Raveling, reshape order keyword unnecessarily confuses index and memory ordering
Hi, On Wed, Apr 3, 2013 at 8:52 AM, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: On Wed, Apr 3, 2013 at 6:24 AM, Sebastian Berg sebast...@sipsolutions.net wrote: the context where it gets applied. So giving the same strategy two different names is silly; if anything it's the contexts that should have different names. Yup, thats how I think about it too... me too... But I would really love if someone would try to make the documentation simpler! yes, I think this is where the solution lies. No question that better docs would be an improvement, let's all agree on that. We all agree that 'order' is used with two different and orthogonal meanings in numpy. I think we are now more or less agreeing that: np.reshape(a, (3, 4), index_order='F') is at least as clear as: np.reshape(a, (3, 4), order='F') Do I have that right so far? Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] try to solve issue #2649 and revisit #473
On 4/3/2013 2:44 PM, huangkan...@gmail.com wrote: I suggest add function dot to matrix import numpy as np; x = np.arange(5); I = np.asmatrix(np.identity(5)); I.dot(x) matrix([[ 0., 1., 2., 3., 4.]]) Alan Isaac ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
On Wed, Apr 3, 2013 at 9:33 AM, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: dave.hirschf...@gmail.com wrote: I found no reasonable way around it other than bypassing the numpy conversion entirely Exactly - we have come to the same conclusion. By the way, it's also consistent -- an ISO string without a TZ is interpreted as a to mean use the locale, but a datetime object without a TZ is interpreted as UTC, so you get this: In [68]: dt Out[68]: datetime.datetime(2013, 4, 3, 12, 0) In [69]: np.dateti np.datetime64 np.datetime_as_string np.datetime_data In [69]: np.datetime64(dt) Out[69]: numpy.datetime64('2013-04-03T05:00:00.00-0700') In [70]: np.datetime64(dt.iso) dt.isocalendar dt.isoformatdt.isoweekday In [70]: np.datetime64(dt.isoformat()) Out[70]: numpy.datetime64('2013-04-03T12:00:00-0700') two different results! (and as it happens, datetime.datetime does not have an ISO string parser, so it's not completely trivial to round-trip though that...) On Wed, Apr 3, 2013 at 6:49 AM, Nathaniel Smith n...@pobox.com wrote: Wow, that's truly broken. I'm sorry. Did you put this in? break out the pitchforks! ( ;-) ) Many of the aspects of how the datetime64 is are from me. I started out from the datetime64 NEP, but it wasn't fleshed out enough so I had to fill in lots of details. I guess your pitchforks are pointing at me. ;) For the way this specific part of the code is, I think it's hard to not have it broken one way or another, no matter how we do it. One thing I observed is the printing of getting the current time is weird if you're looking at it interactively. In general, if you get the current time, and print it in UTC, it's the wrong time unless you're in UTC. Python's datetime doesn't help the situation by having datetime.now() return a 'local' time. In [1]: import numpy as np In [2]: from datetime import datetime In [3]: np.datetime64('now') Out[3]: numpy.datetime64('2013-04-03T12:17:58-0700') In [4]: np.datetime_as_string(np.datetime64('now'), timezone='UTC') Out[4]: '2013-04-03T19:17:59Z' In [5]: datetime.now() Out[5]: datetime.datetime(2013, 4, 3, 12, 18, 2, 582000) In [6]: datetime.now().isoformat() Out[6]: '2013-04-03T12:18:06.796000' In [7]: np.datetime64(datetime.now()) Out[7]: numpy.datetime64('2013-04-03T05:18:15.525000-0700') In [8]: np.datetime64(datetime.now().isoformat()) Out[8]: numpy.datetime64('2013-04-03T12:18:25.291000-0700') I'm skeptical that just switching to UTC everywhere is actually the right solution. It smells like one of those solutions that's simple, neat, and wrong. well, actually, I don't think UTC everywhere is quite what's proposed -- really it's naive datetimes -- it would be up to the user/application to make sure the time zones are consistent. It seems to me that adding a time zone to the datetime64 metadata might be a good idea, and then allowing it to be None to behave like the Python naive datetimes. This wouldn't be a trivial addition, though. Using Python's timezone object doesn't seem like a good idea, because would require things to be converted to/from Python's datetime to be processed every time, which would remove the performance benefits of NumPy. The boost datetime library has a nice timezone object which could be used as inspiration for an equivalent in NumPy, but I think any way we cut it would be a lot of work. Which does mean that parsing a ISO string with a timezone becomes problematic... Yeah, there are a number of cases. How would it transform '2013-04-03T12:18' to a datetime64 with a timezone by default? I guess that would be to use the datetime64's metadata probably. How would it transform '2013-04-03T12:18Z' or '2013-04-03T12:18-0700' to a datetime64 with no timezone? Do we throw an error in the default conversion, and have a separate parsing function that allows more control? (I don't know anything about calendar-time series handling, so I have no ability to actually judge this stuff, but wouldn't one problem be if you want to know about business days/hours? right -- then you'd want to use local time, so numpy might think it's ISO, but it'd actually be local time. Anyway, at the moment, I don't think datetime64 does this right anyway. I don't see mention of the timezone in the busday functions. I havne't checked to see if they use the locale TZ or ignore it, but either way is wrong (actually, using the locale setting is worse...) The busday functions just operate on datetime64[D]. There is no timezone interaction there, except for how a datetime with a date unit converts to/from a datetime which includes time. Maybe datetime dtypes should be parametrized by both granularity and timezone? That may be a good option. However, I suspect it's pretty hard to actually use the timezone correctly and consistently, so Im nervous about that. In any case, we'd need to make sure that the user could specify timezone on I/O and
Re: [Numpy-discussion] Raveling, reshape order keyword unnecessarily confuses index and memory ordering
On Wed, Apr 3, 2013 at 11:39 AM, Matthew Brett matthew.br...@gmail.com wrote: It was not enough for me or the three others who will publicly admit to the shame of finding it confusing without further thought. I would submit that some of the confusion came from the fact that with ravel(), and the 'A' and 'K' flags, you are forced to figure out BOTH index_order and memory_order -- with one flag -- I know I'm still not clear what I'd get in complex situations. Again, I just can't see a reason not to separate these ideas. I agree, but really separating them -- but ideally having a given function only deal with one or the other, not both at once. We are not arguing about backwards compatibility here, only about clarity. while it could be changed while strictly maintaining backward compatibility -- it is a change that would need to filter through the docs, example, random blog posts, stack=overflow questions, etc.. Is that worth it? I'm not convinced Right. I think you may now be close to my own discomfort when faced with working out (fast) what: np.reshape(a, (3,4), order='F') I still think it's cause you know too much ;-) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Raveling, reshape order keyword unnecessarily confuses index and memory ordering
On Wed, Apr 3, 2013 at 11:52 PM, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: On Wed, Apr 3, 2013 at 11:39 AM, Matthew Brett matthew.br...@gmail.com wrote: It was not enough for me or the three others who will publicly admit to the shame of finding it confusing without further thought. I would submit that some of the confusion came from the fact that with ravel(), and the 'A' and 'K' flags, you are forced to figure out BOTH index_order and memory_order -- with one flag -- I know I'm still not clear what I'd get in complex situations. Again, I just can't see a reason not to separate these ideas. I agree, but really separating them -- but ideally having a given function only deal with one or the other, not both at once. We are not arguing about backwards compatibility here, only about clarity. while it could be changed while strictly maintaining backward compatibility -- it is a change that would need to filter through the docs, example, random blog posts, stack=overflow questions, etc.. Not only that, we would then also be in the situation of having `order` *and* `xxx_order` keywords. This is also confusing, at least as much as the current situation imho. Ralf Is that worth it? I'm not convinced Right. I think you may now be close to my own discomfort when faced with working out (fast) what: np.reshape(a, (3,4), order='F') I still think it's cause you know too much ;-) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Please stop bottom posting!!
On Wed, Apr 3, 2013 at 11:00 PM, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: Best of all is intelligent editing of the thread so far -- edit it down to the key points you are commenting on, and intersperse your comments. That way your email stands on its own as meaningful, but there is not a big pile of left over crap to wade through to read your fabulous pithy opinions Traditionally this is what the phrase bottom posting meant, as a term of art, and is the key reason why those old netiquette guides recommend it. I guess the unexpressed nuances of such definitions get lost over time as people encounter them without the relevant context, though -- sort of like how the full in-context meaning of order= gets lost ;-). -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] try to solve issue #2649 and revisit #473
On Wed, Apr 3, 2013 at 1:03 PM, Alan G Isaac alan.is...@gmail.com wrote: On 4/3/2013 3:18 PM, huangkan...@gmail.com wrote: In my view, the result should be a 1d array, the same as I.A.dot(x). But the maintainers wanted operations with matrices to return matrices whenever possible. So instead of returning x it returns np.matrix(x). the matrix object is a fine idea, but the key problem is that it provides a 2-d matrix, but no concept of a 1-d vector. I think it would all be a cleaner if there were a row-vector and column-vector object to accompany matrix -- they things that naturally return a vector could do so, You can't use a regular 1-d array because there is no way to distinguish between a row or column version. But as Alan sid, this was all hashed out a few years back -- a bunch of great ideas, but no one to implement them. The truth is that matrix has little value outside of teaching, so no one with the skills to push it forward uses it themselves. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Please stop bottom posting!!
On 04/03/2013 08:06 PM, Charles R Harris wrote: snip Nice editing! ;) Steve ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Raveling, reshape order keyword unnecessarily confuses index and memory ordering
On Wed, Apr 3, 2013 at 9:13 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Apr 3, 2013 at 11:44 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Wed, Apr 3, 2013 at 8:52 AM, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: On Wed, Apr 3, 2013 at 6:24 AM, Sebastian Berg sebast...@sipsolutions.net wrote: the context where it gets applied. So giving the same strategy two different names is silly; if anything it's the contexts that should have different names. Yup, thats how I think about it too... me too... But I would really love if someone would try to make the documentation simpler! yes, I think this is where the solution lies. No question that better docs would be an improvement, let's all agree on that. We all agree that 'order' is used with two different and orthogonal meanings in numpy. I think we are now more or less agreeing that: np.reshape(a, (3, 4), index_order='F') is at least as clear as: np.reshape(a, (3, 4), order='F') I believe uur job here is to come to some consensus. In that spirit, I think we do agree on these statements above. Now we have the cost / benefit. Benefit : Some people may find it easier to understand numpy when these constructs are separated. Cost : There might be some confusion because we have changed the default keywords. Benefit --- What proportion of people would find it easier to understand with the order constructs separated? Clearly Chris and Josef and Sebastian - you estimate I think no change in your understanding, because your understanding was near complete already. At least I, Paul Ivanov, JB Poline found the current state strikingly confusing. I think we have other votes for that position here. It's difficult to estimate the proportions now because my original email and the subsequent discussion are based on the distinction already being made. So, it is hard for us to be objective about whether a new user is likely to get confused. At least it seems reasonable to say that some moderate proportion of users will get confused. In that situation, it seems to me the long-term benefit for separating these ideas is relatively high. The benefit will continue over the long term. Cost --- The ravel docstring would looks something like this: index_order : {'C','F', 'A', 'K'}, optional ... This keyword used to be called simply 'order', and you can also use the keyword 'order' to specify index_order (this parameter). The problem would then be that, for a while, there will be older code and docs using 'order' instead of 'index_order'. I think this would not cause much trouble. Reading the docstring will explain the change. The old code will continue to work. This cost will decrease to zero over time. So, if we are planning for the long-term for numpy, I believe the benefit to the change considerably outweighs the cost. I'm happy to do the code changes, so that's not an issue. Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Moving linalg c code
Hi All, There is a PR https://github.com/numpy/numpy/pull/2954 that adds some blas and lapack functions to numpy. I'm thinking that if that PR is merged it would be good to move all of the blas and lapack functions, including the current ones in numpy/linalg into a single directory somewhere in numpy/core/src. So there are two questions here: should we be adding the new functions, and if so, should we consolidate all the blas and lapack C code into its own directory somewhere in numpy/core/src. Thoughts? Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
On Wed, Apr 3, 2013 at 7:52 PM, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: Personally, I never need finer resolution than seconds, nor more than a century, so it's no big deal to me, but just wondering A use case for finer resolution than seconds (in our field, no less!) is lightning data. At the last SciPy conference, a fellow meteorologist mentioned how difficult it was to plot out lightning data at resolutions finer than microseconds (which is the resolution of the python datetime objects). Matplotlib has not supported the datetime64 object yet (John passed before he could write up that patch). Cheers! Ben By the way, my 12th Rule of Programming is Never roll your own datetime ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] timezones and datetime64
On 4/3/13, Benjamin Root ben.r...@ou.edu wrote: On Wed, Apr 3, 2013 at 7:52 PM, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: Personally, I never need finer resolution than seconds, nor more than a century, so it's no big deal to me, but just wondering A use case for finer resolution than seconds (in our field, no less!) is lightning data. At the last SciPy conference, a fellow meteorologist mentioned how difficult it was to plot out lightning data at resolutions finer than microseconds (which is the resolution of the python datetime objects). Matplotlib has not supported the datetime64 object yet (John passed before he could write up that patch). Cheers! Ben By the way, my 12th Rule of Programming is Never roll your own datetime A rule on par with never get involved in a land war in Asia: both equally Fraught With Peril. :) Warren ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion