Re: [Numpy-discussion] Add sliding_window_view method to numpy

2020-11-06 Thread Zimmermann Klaus
Hi all,


I have absolutely no problem keeping this out of the main namespace.

In fact I'd like to point out that it was not my idea. Rather, it was
proposed by Bas van Beek in the comments [1,2] and received a little
more scrutiny from Eric Wieser in [3].

The reason that it didn't receive the scrutiny it probably deserves is
that it got a bit mangled up with the array dispatch discussion; sorry
for that.

On the subject matter, I am also curious about the potential for
confusion. What other behavior could one expect from a sliding window
view with this shape?

As I said, I am completely fine with keeping this out of the main
namespace, but I agree with Sebastian's comment, that
`np.lib.stride_tricks` is perhaps not the best namespace. The reason
from my point of view is that stride tricks is really a technical (and
slightly ominous) name that might throw of more application oriented
programmers from finding and using this function. Thinking of my
scientist colleagues, I think those are exactly the kind of users that
could benefit from such a prototyping tool.

Cheers
Klaus



[1] https://github.com/numpy/numpy/pull/17394#issuecomment-700998618
[2] https://github.com/numpy/numpy/pull/17394#discussion_r498215468
[3] https://github.com/numpy/numpy/pull/17394#discussion_r498724340

On 06/11/2020 01:39, Sebastian Berg wrote:
> On Thu, 2020-11-05 at 17:35 -0600, Sebastian Berg wrote:
>> On Thu, 2020-11-05 at 12:51 -0800, Stephan Hoyer wrote:
>>> On Thu, Nov 5, 2020 at 11:16 AM Ralf Gommers <
>>> ralf.gomm...@gmail.com>
>>> wrote:
>>>
 On Thu, Nov 5, 2020 at 4:56 PM Sebastian Berg <
 sebast...@sipsolutions.net>
 wrote:

> Hi all,
>
> just a brief note that I merged this proposal:
>
> https://github.com/numpy/numpy/pull/17394
>
> adding `np.sliding_window_view` into the 1.20 release of NumPy.
>
> There was only one public API change, and that is that the
> `shape`
> argument is now called `window_shape`.
>
> This is still a good time for feedback in case you have a
> better
> idea
> e.g. for the function or parameter names.
>

 The old PR had this in the lib.stride_tricks namespace. Seeing it
 in the
 main namespace is unexpected and likely will lead to
 issues/questions,
 given that such an overlapping view is going to do behave in ways
 the
 average user will be surprised by. It may also lead to requests
 for
 other
 array/tensor libraries to implement this. I don't see any
 discussion on
 this in PR 17394, it looks like a decision by the PR author that
 no
 one
 commented on - reconsider that?

 Cheers,
 Ralf

>>>
>>> +1 let's keep this in the lib.stride_tricks namespace.
>>>
>>
>> I have no reservations against having it in the main namespace and am
>> happy either way (it can still be exposed later in any case). It is
>> the
>> conservative choice and maybe it is an uncommon enough function that
>> it
>> deserves being a bit hidden...
> 
> 
> In any case, its the safe bet for NumPy 1.20 at least so I opened a PR:
> 
> https://github.com/numpy/numpy/pull/17720
> 
> Name changes, etc. are also possible of course.
> 
> I still think it might be nice to find a better place for this type of
> function that `np.lib.stride_tricks` though, but dunno...
> 
> - Sebastian
> 
> 
> 
>>
>> But I am curious, it sounds like you have both very strong
>> reservations, and I would like to understand them better.
>>
>> The behaviour can be surprising, but that is why the default is a
>> read-
>> only view.  I do not think it is worse than `np.broadcast_to` in this
>> regard. (It is nowhere near as dangerous as `as_strided`.)
>>
>> It is true that it is specific to NumPy (memory model). So that is
>> maybe a good enough reason right now.  But I am not sure that
>> stuffing
>> things into a pretty hidden `np.lib.*` namespaces is a great long
>> term
>> solution either. There is very little useful functionality hidden
>> away
>> in `np.lib.*` currently.
>>
>> Cheers,
>>
>> Sebastian
>>


> Cheers,
>
> Sebastian
>
>
>
> On Mon, 2020-10-12 at 08:39 +, Zimmermann Klaus wrote:
>> Hello,
>>
>> I would like to draw the attention of this list to PR #17394
>> [1] that
>> adds the implementation of a sliding window view to numpy.
>>
>> Having a sliding window view in numpy is a longstanding open
>> issue
>> (cf
>> #7753 [2] from 2016). A brief summary of the discussions
>> surrounding
>> it
>> can be found in the description of the PR.
>>
>> This PR implements a sliding window view based on stride
>> tricks.
>> Following the discussion in issue #7753, a first
>> implementation
>> was
>> provided by Fanjin Zeng in PR #10771. After some discussion,
>> that PR
>> stalled and I picked up the issue in the present PR #17394.
>> It
>> is

Re: [Numpy-discussion] datetime64: Remove deprecation warning when constructing with timezone

2020-11-06 Thread Noam Yorav-Raphael
Hi,

I actually arrived at this by first trying to use pandas.Timestamp and
getting very frustrated about it. With pandas, I get:

>>> pd.Timestamp.now()
Timestamp('2020-11-06 09:45:24.249851')

I find the whole notion of a "timezone naive timestamp" to be nearly
meaningless. A timestamp should mean a moment in time (as the current numpy
documentation defines very well). A "naive timestamp" doesn't mean
anything. It's exactly like a "unit naive length". I can have a Length type
which just takes a number, and be very happy that it works both if my "unit
zone" is inches or centimeters. So "Length(3)" will mean 3 cm in most of
the world and 3 inches in the US. But then, if I get "Length(3)" from
someone, I can't be sure what length it refers to.

So currently, this happens with pandas timestamps:

>>> os.environ['TZ'] = 'UTC'; time.tzset()
... t0 = pd.Timestamp.now()
... time.sleep(1)
... os.environ['TZ'] = 'EST-5'; time.tzset()
... t1 = pd.Timestamp.now()
... t1 - t0
Timedelta('0 days 05:00:01.001583')

This is not just theoretical - I actually need to work with data from
several devices, each in its own time zone. And I need to know that I won't
get such meaningless results.

And you can even get something like this:

>>> t0 = pd.Timestamp.now()
... time.sleep(10)
... t1 = pd.Timestamp.now()
... t1 - t0
Timedelta('0 days 01:00:10.001583')

if the first measurement happened to be in winter time and the second
measurement happened to be in daylight saving time.

The solution is simple, and is what datetime64 used to do before the change
- have a type that just represents a moment in time. It's not "in UTC" - it
just stores the number of seconds that passed since an agreed moment in
time (which is usually 1970-01-01 02:00+0200, which is more commonly
referred to as 1970-01-01 00:00Z - it's the exact same moment).

I think it would make things clearer if I'll mention that there are
operations that are not dealing with timestamps. For example, it's
meaningless to ask what is the year of a timestamp - it may depend on the
time zone. These are always *human* related questions, that depend on
certain human conventions. We can call them "calendar questions". For these
types of questions, a type that includes both a timestamp and a timezone
offset (in minutes from UTC) can be useful. Some questions even require
full timezone information, meaning a function that defines what's the
timezone offset for each moment. However, I don't think numpy should deal
with those calendar issues. As a very simple example, even for
"timestamp+offset" types, it's not clear how to compare them - should
values with the same timestamp and different offsets be considered equal or
not? And in virtually all of my data analysis, this calendar aspect has
nothing to do with the questions I'm trying to answer.

I have a suggestion. Instead of changing datetime64 (which I consider to be
ill-defined, but never mind), add a new type called "timestamp64". It will
have the exact same behavior as datetime64 had before the change, except
that its only allowed units will be seconds, milliseconds, microseconds and
nanoseconds.  Removing the longer units will make it clear that it doesn't
deal with calendar and dates. Also, all the business day functionality will
not be applicable to timestamp64. In order to get calendar information
(such as the year) from timestamp64, you will have to manually convert it
to python's datetime (or to np.datetime64) with an explicit timezone (utc,
local, an offset, or a timezone object).

What do you think?

Thanks,
Noam





On Fri, Nov 6, 2020 at 1:45 AM Stephan Hoyer  wrote:

> I can try to dig up the old discussions, but datetime64 used to implement
> both (1) and (3), and this was updated in a very intentional way.
> Datetime64 now works like Python's own time-zone naive datetime.datetime
> objects. The documentation referencing "Z" should be updated -- datetime64
> can be in any timezone you like.
>
> Timezone aware datetime objects are certainly useful, but NumPy's
> datetime64 was restricted to UTC. The consensus was that it was worse to
> have UTC-only rather than timezone-naive-only. NumPy's datetime64 is often
> used for data analysis purposes, for which automatic conversion to the
> local timezone of the computer running the analysis is often
> counter-productive.
>
> If you care about timezone conversions, I would highly recommend looking
> into pandas's Timestamp class for this purpose. In the future, this would
> be a good use-case for a new custom NumPy dtype. (The existing
> np.datetime64 code cannot easily handle multiple timezones.)
>
> On Thu, Nov 5, 2020 at 1:04 PM Eric Wieser 
> wrote:
>
>> Without weighing in yet on how I feel about the deprecation, you can see
>> some discussion about why this was originally deprecated in the PR that
>> introduced the warning:
>>
>> https://github.com/numpy/numpy/pull/6453
>>
>> Eric
>>
>> On Thu, Nov 5, 2020, 20:13 Noam Yorav-Raphael  wrote:
>>
>>> Hi,
>>>
>>> I sug

Re: [Numpy-discussion] Add sliding_window_view method to numpy

2020-11-06 Thread Ralf Gommers
On Fri, Nov 6, 2020 at 9:51 AM Zimmermann Klaus 
wrote:

> Hi all,
>
>
> I have absolutely no problem keeping this out of the main namespace.
>
> In fact I'd like to point out that it was not my idea. Rather, it was
> proposed by Bas van Beek in the comments [1,2] and received a little
> more scrutiny from Eric Wieser in [3].
>

Thanks, between two PRs with that many comments, I couldn't figure that out
- just saw the commit that make the change.


> The reason that it didn't receive the scrutiny it probably deserves is
> that it got a bit mangled up with the array dispatch discussion; sorry
> for that.
>

No worries at all. This is why we announce new features on the mailing list.


> On the subject matter, I am also curious about the potential for
> confusion. What other behavior could one expect from a sliding window
> view with this shape?
>
> As I said, I am completely fine with keeping this out of the main
> namespace, but I agree with Sebastian's comment, that
> `np.lib.stride_tricks` is perhaps not the best namespace.


I agree that that's not a great namespace. There's multiple issues with
namespaces, we basically have three good ones (fft, linalg, random) and a
bunch of other ones that range from questionable to terrible. See
https://github.com/numpy/numpy/blob/master/numpy/tests/test_public_api.py#L127
for details.

This would be a good thing to work on - making the `numpy.lib` namespace
not bleed into `numpy` via `import *` is one thing to do there, and there's
many others. But given backwards compat constraints it's not easy.


> The reason
> from my point of view is that stride tricks is really a technical (and
> slightly ominous) name that might throw of more application oriented
> programmers from finding and using this function. Thinking of my
> scientist colleagues, I think those are exactly the kind of users that
> could benefit from such a prototyping tool.
>

That phrasing is one of a number of concerns. NumPy is normally not in the
business of providing things that are okay as a prototyping tool, but are
potentially extremely slow (as pointed out in the Notes section of the
docstring). A function like that would basically not be the right tool for
almost anything in, e.g., SciPy - it requires an iterative algorithm. In
NumPy we don't prefer performance at all costs, but in general it's pretty
decent rather than "Numba or Cython may gain you 100x here".

Other issues include:
2) It is very specific to NumPy's memory model (as pointed out by you and
Sebastian) - just like the rest of stride_tricks
3) It has "view" in the name, which doesn't quite make sense for the main
namespace (also connected to point 2 above).
4) The cost of putting something in the main namespace for other
array/tensor libraries is large. Maybe other libraries, e.g. CuPy, Dask,
TensorFlow, PyTorch, JAX, MXNet, aim to reimplement part or all of the main
NumPy namespace as well as possible. This would trigger discussions and
likely many person-weeks of work for others.
5) It's a useful function, but it's very much on the margins of NumPy's
scope. It could easily have gone into, for example, scipy.signal. At this
point the bar for functions going into the main namespace should be (and
is) high.

All this taken together means it's not even a toss-up for me. If it were
just one or two of these points, maybe. But given all the above, I'm pretty
confident saying "it does not belong in the main namespace".

Cheers,
Ralf



>
> Cheers
> Klaus
>
>
>
> [1] https://github.com/numpy/numpy/pull/17394#issuecomment-700998618
> [2] https://github.com/numpy/numpy/pull/17394#discussion_r498215468
> [3] https://github.com/numpy/numpy/pull/17394#discussion_r498724340
>
> On 06/11/2020 01:39, Sebastian Berg wrote:
> > On Thu, 2020-11-05 at 17:35 -0600, Sebastian Berg wrote:
> >> On Thu, 2020-11-05 at 12:51 -0800, Stephan Hoyer wrote:
> >>> On Thu, Nov 5, 2020 at 11:16 AM Ralf Gommers <
> >>> ralf.gomm...@gmail.com>
> >>> wrote:
> >>>
>  On Thu, Nov 5, 2020 at 4:56 PM Sebastian Berg <
>  sebast...@sipsolutions.net>
>  wrote:
> 
> > Hi all,
> >
> > just a brief note that I merged this proposal:
> >
> > https://github.com/numpy/numpy/pull/17394
> >
> > adding `np.sliding_window_view` into the 1.20 release of NumPy.
> >
> > There was only one public API change, and that is that the
> > `shape`
> > argument is now called `window_shape`.
> >
> > This is still a good time for feedback in case you have a
> > better
> > idea
> > e.g. for the function or parameter names.
> >
> 
>  The old PR had this in the lib.stride_tricks namespace. Seeing it
>  in the
>  main namespace is unexpected and likely will lead to
>  issues/questions,
>  given that such an overlapping view is going to do behave in ways
>  the
>  average user will be surprised by. It may also lead to requests
>  for
>  other
>  array/tensor libraries to i

Re: [Numpy-discussion] datetime64: Remove deprecation warning when constructing with timezone

2020-11-06 Thread Brock Mendel
> I find the whole notion of a "timezone naive timestamp" to be nearly
meaningless

>From the perspective of, say, the dateutil parser, what would you do with
"2020-11-06 07:48"?  If you assume it's UTC you'll be wrong in this case.
If you assume it is in your local timezone, you'll be wrong in Europe.
Timezone-naive datetimes are an abstraction for exactly this case.

>>> t0 = pd.Timestamp.now()

You can use `pd.Timestamp.now("UTC")`.  See also
https://mail.python.org/archives/list/datetime-...@python.org/thread/PT4JWJLYBE5R2QASVBPZLHH37ULJQR43/
, https://github.com/pandas-dev/pandas/issues/22451





On Fri, Nov 6, 2020 at 2:48 AM Noam Yorav-Raphael 
wrote:

> Hi,
>
> I actually arrived at this by first trying to use pandas.Timestamp and
> getting very frustrated about it. With pandas, I get:
>
> >>> pd.Timestamp.now()
> Timestamp('2020-11-06 09:45:24.249851')
>
> I find the whole notion of a "timezone naive timestamp" to be nearly
> meaningless. A timestamp should mean a moment in time (as the current numpy
> documentation defines very well). A "naive timestamp" doesn't mean
> anything. It's exactly like a "unit naive length". I can have a Length type
> which just takes a number, and be very happy that it works both if my "unit
> zone" is inches or centimeters. So "Length(3)" will mean 3 cm in most of
> the world and 3 inches in the US. But then, if I get "Length(3)" from
> someone, I can't be sure what length it refers to.
>
> So currently, this happens with pandas timestamps:
>
> >>> os.environ['TZ'] = 'UTC'; time.tzset()
> ... t0 = pd.Timestamp.now()
> ... time.sleep(1)
> ... os.environ['TZ'] = 'EST-5'; time.tzset()
> ... t1 = pd.Timestamp.now()
> ... t1 - t0
> Timedelta('0 days 05:00:01.001583')
>
> This is not just theoretical - I actually need to work with data from
> several devices, each in its own time zone. And I need to know that I won't
> get such meaningless results.
>
> And you can even get something like this:
>
> >>> t0 = pd.Timestamp.now()
> ... time.sleep(10)
> ... t1 = pd.Timestamp.now()
> ... t1 - t0
> Timedelta('0 days 01:00:10.001583')
>
> if the first measurement happened to be in winter time and the second
> measurement happened to be in daylight saving time.
>
> The solution is simple, and is what datetime64 used to do before the
> change - have a type that just represents a moment in time. It's not "in
> UTC" - it just stores the number of seconds that passed since an agreed
> moment in time (which is usually 1970-01-01 02:00+0200, which is more
> commonly referred to as 1970-01-01 00:00Z - it's the exact same moment).
>
> I think it would make things clearer if I'll mention that there are
> operations that are not dealing with timestamps. For example, it's
> meaningless to ask what is the year of a timestamp - it may depend on the
> time zone. These are always *human* related questions, that depend on
> certain human conventions. We can call them "calendar questions". For these
> types of questions, a type that includes both a timestamp and a timezone
> offset (in minutes from UTC) can be useful. Some questions even require
> full timezone information, meaning a function that defines what's the
> timezone offset for each moment. However, I don't think numpy should deal
> with those calendar issues. As a very simple example, even for
> "timestamp+offset" types, it's not clear how to compare them - should
> values with the same timestamp and different offsets be considered equal or
> not? And in virtually all of my data analysis, this calendar aspect has
> nothing to do with the questions I'm trying to answer.
>
> I have a suggestion. Instead of changing datetime64 (which I consider to
> be ill-defined, but never mind), add a new type called "timestamp64". It
> will have the exact same behavior as datetime64 had before the change,
> except that its only allowed units will be seconds, milliseconds,
> microseconds and nanoseconds.  Removing the longer units will make it clear
> that it doesn't deal with calendar and dates. Also, all the business day
> functionality will not be applicable to timestamp64. In order to get
> calendar information (such as the year) from timestamp64, you will have to
> manually convert it to python's datetime (or to np.datetime64) with an
> explicit timezone (utc, local, an offset, or a timezone object).
>
> What do you think?
>
> Thanks,
> Noam
>
>
>
>
>
> On Fri, Nov 6, 2020 at 1:45 AM Stephan Hoyer  wrote:
>
>> I can try to dig up the old discussions, but datetime64 used to implement
>> both (1) and (3), and this was updated in a very intentional way.
>> Datetime64 now works like Python's own time-zone naive datetime.datetime
>> objects. The documentation referencing "Z" should be updated -- datetime64
>> can be in any timezone you like.
>>
>> Timezone aware datetime objects are certainly useful, but NumPy's
>> datetime64 was restricted to UTC. The consensus was that it was worse to
>> have UTC-only rather than timezone-naive-only. NumPy's datetime64

Re: [Numpy-discussion] Add sliding_window_view method to numpy

2020-11-06 Thread Zimmermann Klaus
Hi,

On 06/11/2020 15:58, Ralf Gommers wrote:
> On Fri, Nov 6, 2020 at 9:51 AM Zimmermann Klaus
> mailto:klaus.zimmerm...@smhi.se>> wrote:
> I have absolutely no problem keeping this out of the main namespace.
> 
> In fact I'd like to point out that it was not my idea. Rather, it was
> proposed by Bas van Beek in the comments [1,2] and received a little
> more scrutiny from Eric Wieser in [3].
> 
> Thanks, between two PRs with that many comments, I couldn't figure that
> out - just saw the commit that make the change.

Understandable, no worries.

> On the subject matter, I am also curious about the potential for
> confusion. What other behavior could one expect from a sliding window
> view with this shape?
> 
> As I said, I am completely fine with keeping this out of the main
> namespace, but I agree with Sebastian's comment, that
> `np.lib.stride_tricks` is perhaps not the best namespace. 
> 
> 
> I agree that that's not a great namespace. There's multiple issues with
> namespaces, we basically have three good ones (fft, linalg, random) and
> a bunch of other ones that range from questionable to terrible. See
> https://github.com/numpy/numpy/blob/master/numpy/tests/test_public_api.py#L127
> 
> for details.
> 
> This would be a good thing to work on - making the `numpy.lib` namespace
> not bleed into `numpy` via `import *` is one thing to do there, and
> there's many others. But given backwards compat constraints it's not easy.

I understand cleaning up all the namespaces is a giant task, so far, far
out of scope here. As said before, I also completely agree to keep it
out of the main namespace (though I will still argue below :P).

I was just wondering if, of the top your head, an existing, better fit
comes to mind?

> The reason
> from my point of view is that stride tricks is really a technical (and
> slightly ominous) name that might throw of more application oriented
> programmers from finding and using this function. Thinking of my
> scientist colleagues, I think those are exactly the kind of users that
> could benefit from such a prototyping tool.
> 
> 
> That phrasing is one of a number of concerns. NumPy is normally not in
> the business of providing things that are okay as a prototyping tool,
> but are potentially extremely slow (as pointed out in the Notes section
> of the docstring). A function like that would basically not be the right
> tool for almost anything in, e.g., SciPy - it requires an iterative
> algorithm. In NumPy we don't prefer performance at all costs, but in
> general it's pretty decent rather than "Numba or Cython may gain you
> 100x here".

I still think that the performance concern is a bit overblown. Yes,
application with large windows can need more FLOPs by an equally large
factor. But most such applications will use small to moderate windows.
Furthermore, this view focuses only on FLOPs. In my current field of
climate science (and many others), that is almost never the limiting
factor. Memory demands are far more problematic and incidentally, those
are more likely to increase in other methods that require the storage of
ancillary, temporary data.

> Other issues include:
> 2) It is very specific to NumPy's memory model (as pointed out by you
> and Sebastian) - just like the rest of stride_tricks
Not wrong, but on the other hand, that memory model is not exotic. C,
Fortran, and any number of other languages play very nicely with this,
just as important downstream libraries like dask.

> 3) It has "view" in the name, which doesn't quite make sense for the
> main namespace (also connected to point 2 above).
Ok.

> 4) The cost of putting something in the main namespace for other
> array/tensor libraries is large. Maybe other libraries, e.g. CuPy, Dask,
> TensorFlow, PyTorch, JAX, MXNet, aim to reimplement part or all of the
> main NumPy namespace as well as possible. This would trigger discussions
> and likely many person-weeks of work for others.
Agreed. Though I have to say that my whole motivation comes from
corresponding issues in dask that where specifically waiting for (the
older version of) this PR (see [1, 2,...]). But I understand that dask
is effectively much closer to the numpy memory model than, say, CuPy, so
don't take this to mean it should be in the main namespace.

> 5) It's a useful function, but it's very much on the margins of NumPy's
> scope. It could easily have gone into, for example, scipy.signal. At
> this point the bar for functions going into the main namespace should be> 
> (and is) high.
I agree that the bar for the main namespace should be high!

> All this taken together means it's not even a toss-up for me. If it were
> just one or two of these points, maybe. But given all the above, I'm
> pretty confident saying "it does not belong in the main namespace".
Again, I am happy with that.


Thanks for your