from:"Robert Kern"

[Numpy-discussion] Re: Adding bfill() to numpy.

2024-05-20 Thread Robert Kern

On Mon, May 20, 2024 at 1:17 PM george trojan 
wrote:

> xarray has both bfill and ffill for DataArrays. The implementation useless
> bottleneck https://github.com/pydata/bottleneck.
>

bottleneck would probably be a good home for these functions (though I
don't speak for its maintainers). If it gets picked up by a bunch of other
array implementations, then you can make a proposal to have them added to
the Array API standard. numpy probably won't add them unless if that
happens first.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: dtype=object arrays not treated like Python list?

2024-03-29 Thread Robert Kern

On Fri, Mar 29, 2024 at 9:11 AM Steven G. Johnson  wrote:

> Should a dtype=object array be treated more like Python lists for type
> detection/coercion reasons?   Currently, they are treated quite differently:
>
> >>> import numpy as np
> >>> np.isfinite([1,2,3])
> array([ True,  True,  True])
> >>> np.isfinite(np.asarray([1,2,3], dtype=object))
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: ufunc 'isfinite' not supported for the input types, and the
> inputs could not be safely coerced to any supported types according to the
> casting rule ''safe''
>
> The reason I ask is that we ran into something similar when trying to pass
> wrappers around Julia arrays to Python.  A Julia `Any[]` array is much like
> a Python list or a Numpy `object` array, and exposed in Python as a subtype
> of MutableSequence, but we found that it was treated by NumPy as more like
> a Numpy `object` array than a Python list (
> https://github.com/JuliaPy/PythonCall.jl/issues/486).
>
> Would it be desirable to treat a 1d Numpy `object` array more like a
> Python `list`?   Or is there a way for externally defined types to opt-in
> to the `list` behavior?  (I couldn't figure out in the numpy source code
> where `list` is being special-cased?)
>

`list` isn't special-cased, per se. Most numpy functions work on `ndarray`
objects and accept "array-like" objects like `list`s by coercing them to
`ndarray` objects ASAP using `np.asarray()`. `asarray` will leave existing
`ndarray` instances alone. When you pass `[1,2,3]` to a numpy function,
`np.asarray([1,2,3])` takes that list and infers the dtype and shape from
its contents using just the `Sequence` API. Since they are `int` objects of
reasonable size, the result is `np.array([1, 2, 3], dtype=np.int64)`.  The
`isfinite` ufunc has a loop defined for `dtype=np.int64`.

`np.asarray()` will also check for special methods and properties like
`__array__()` and `__array_interface__` to allow objects to customize how
they should be interpreted as `ndarray` objects, in particular, allowing
memory to be shared efficiently as pointers, if the layouts are compatible,
but also to control the dtype of the final array. PythonCall.jl implements
`__array__()` and `__array_interface__` for its array objects for these
purposes, and if I am not mistaken, explicitly makes `Any[]` convert to a
`dtype=object` `ndarray`
<https://github.com/JuliaPy/PythonCall.jl/blob/f586f2494432f2e5366ff1e2876b8aa532630b54/src/JlWrap/objectarray.jl#L61-L69>
(the
`typestr = "|O"` is equivalent to `dtype=object` in that context). It's
*possible* that you'd really rather have this `PyObjectArray` *not*
implement those interfaces and just let the `np.asarray()` inference work
its magic through the `Sequence` API, but that is expensive. You could also
try doing the type inference yourself and implement that in the
`PyObjectArray.__array__()` implementation and avoid implementing
`__array_interface__` for that object. Then `np.asarray()` will just
delegate to `PyObjectArray.__array__()`.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Polyfit error in displacement

2024-03-26 Thread Robert Kern

On Tue, Mar 26, 2024 at 3:39 PM Luca Bertolotti 
wrote:

> Thanks for your reply yes it seems more appropriate a cubic spline but how
> can i get what they call SC
>

I don't think any of us has enough context to know what "SC" is. It's not a
standard term that I'm aware of.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Converting array to string

2024-02-25 Thread Robert Kern

On Sun, Feb 25, 2024 at 1:52 PM Dom Grigonis  wrote:

> Thank you for your answer,
>
> Yeah, I was unsure if it ever existed in the first place.
>
> Space is less of an issue and something like `a.data.hex()` would be fine
> as long as its speed was on par with `a.tobytes()`. However, it is 10x
> slower on my machine.
>
> This e-mail is pretty much a final check (after a fair bit of research and
> attempts) that it can not be done so I can eliminate this possibility as
> feasible and concentrate on other options.
>

I think that mostly runs the gamut for pure JSON. Consider looking at
BJData, but it's "JSON-based" and not quite pure JSON.

https://neurojson.org/

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Converting array to string

2024-02-25 Thread Robert Kern

On Sat, Feb 24, 2024 at 7:17 PM Dom Grigonis  wrote:

> Hello,
>
> I am seeking a bit of help.
>
> I need a fast way to transfer numpy arrays in json format.
>
> Thus, converting array to list is not an option and I need something
> similar to:
>
> a = np.ones(1000)%timeit a.tobytes()17.6 ms
>
> This is quick, fast and portable. In other words I am very happy with
> this, but...
>
> Json does not support bytes.
>
> Any method of subsequent conversion from bytes to string is number of
> times slower than the actual serialisation.
>
> So my question is: Is there any way to serialise directly to string?
>
> I remember there used to be 2 methods: tobytes and tostring. However, I
> see that tostring is deprecated and its functionality is equivalent to
> `tobytes`. Is it completely gone? Or is there a chance I can still find a
> performant version of converting to and back from `str` type of non-human
> readable form?
>

The old `tostring` was actually the same as `tobytes`. In Python 2, the
`str` type was what `bytes` is now, a string of octets. In Python 3, `str`
became a string a Unicode characters (what you want) and the `bytes` type
was introduced for the string of octects so `tostring` was merely _renamed_
to `tobytes` to match. `tostring` never returned a string of Unicode
characters suitable for inclusion in JSON.

AFAICT, there is not likely to be a much more efficient way to convert from
an array to a reasonable JSONable encoding (e.g. base64). The time you are
seeing is the time it takes to encode that amount of data, period. That
said, if you want to use a quite inefficient hex encoding, `a.data.hex()`
is somewhat faster than the base64 encoding, but it's less than ideal in
terms of space usage.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays

2024-02-15 Thread Robert Kern

On Thu, Feb 15, 2024 at 10:21 AM  wrote:

> Hello Numpy community,
>
> I'm proposing the introduction of a `pipe` method for NumPy arrays to
> enhance their usability and expressiveness.
>

Adding a prominent method like this to `np.ndarray` is something that we
will probably not take up ourselves unless it is adopted by the Array API
standard <https://data-apis.org/array-api/latest/>. It's possible that you
might get some interest there since the Array API deliberately strips out
the number of methods that we already have (e.g. `.mean()`, `.sum()`, etc.)
in favor of functions. A general way to add some kind of fluency cheaply in
an Array API-agnostic fashion might be helpful to people trying to make
their numpy-only code that uses our current set of methods in this way a
bit easier. But you'll have to make the proposal to them, I think, to get
started.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Feature request: Extension of the np.argsort Function - Returning Positional Information for Data

2024-01-16 Thread Robert Kern

On Tue, Jan 16, 2024 at 11:05 PM hao chen 
wrote:

> When dealing with lists that contain duplicate data, np.argsort fails to
> return index values that correspond to the actual sorting positions of the
> data, as it does when handling arrays without duplicates.
>
> Dear Author:
>
> When I use the np.argsort function on an array without duplicate data, the
> returned index values correspond to the sorting positions of the respective
> data.
>
> x = [1, 2, 5, 4]
> rank = np.argsort(x)
> print(rank)
> # [0 1 3 2]
>
> That is not what `argsort` is intended or documented to do. It returns an
array of indices _into `x`_ such that if you took the values from `x` in
that order, you would get a sorted array. That is, if `x` were sorted into
the array `sorted_x`, then `x[rank[i]] == sorted_x[i]` for all `i in
range(len(x))`. The indices in `rank` are positions in `x`, not positions
in `sorted_x`. They happen to correspond in this case, but that's a
coincidence that's somewhat common in these small examples. But consider
`[20, 30, 10, 40]`:

>>> x = np.array([20, 30, 10, 40])
>>> ix = np.argsort(x)
>>> def position(x):
... sorted_x = np.array(x)
... sorted_x.sort()
... return np.searchsorted(sorted_x, x)
...
>>> ip = position(x)
>>> ix
array([2, 0, 1, 3])
>>> ip
array([1, 2, 0, 3])

But also notice:

>>> np.argsort(np.argsort(x))
array([1, 2, 0, 3])

This double-argsort is what you seem to be looking for, though it depends
on what you want from the handling of duplicates (do you return the first
index into the sorted array with the same value as in my `position()`
implementation, or do you return the index that particular item was
actually sorted to).

Either way, we probably aren't going to add this as its own function. Both
options are straightforward combinations of existing primitives.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Change definition of complex sign (and use it in copysign)

2024-01-04 Thread Robert Kern

On Wed, Jan 3, 2024 at 4:09 PM Aaron Meurer  wrote:

> sign(z) = z/|z| is a fairly standard definition. See
> https://oeis.org/wiki/Sign_function and
> https://en.wikipedia.org/wiki/Sign_function. It's also implemented
> this way in MATLAB and Mathematica (see
> https://www.mathworks.com/help/symbolic/sign.html and
> https://reference.wolfram.com/language/ref/Sign.html). The function
> z/|z| is useful because it represents a normalization of z as a vector
> in the complex plane onto the unit circle.
>
> With that being said, I'm not so sure about the suggestion about
> extending copysign(x1, x2) as |x1|*sign(x2). I generally think of
> copysign as a function to manipulate the floating-point representation
> of a number. It literally copies the sign *bit* from x2 into x1. It's
> useful because of things like -0.0, which is otherwise difficult to
> work with since it compares equal to 0.0. I would find it surprising
> for copysign to do a numeric calculation on complex numbers. Also,
> your suggested definition would be wrong for 0.0 and -0.0, since
> sign(0) is 0, and this is precisely where copysign matters.
>

Agreed on all points.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: savetxt() has fmt='%.18e' as default, but fmt='%.16e' is always sufficient

2023-11-25 Thread Robert Kern

On Sat, Nov 25, 2023 at 11:18 AM Jeppe Dakin 
wrote:

> Double-precision numbers need at most 17 significant decimal digits to be
> serialised losslessly. Yet, savetxt() uses 19 by default, meaning that most
> files produced with savetxt() takes up about 9% more disk space than they
> need to, without any benefit. I have described the problem more detailed on
> Stackoverflow:
>
> https://stackoverflow.com/questions/77535380/minimum-number-of-digits-for-exact-double-precision-and-the-fmt-18e-of-numpy
>
> Is there any reason behind the default choice of savetxt(...,
> fmt='%.18e')? If not, why not reduce it to savetxt(..., fmt='%.16e')?
>

A long time ago when `savetxt()` was written, we did not use the reliable
Dragon4 string representation algorithm that guarantees that floating point
numbers are written out with the minimum number of decimal digits needed to
reproduce the number. We may even have relied on the platform's floating
point to string conversion routines, which were of variable quality. The
extra digits accounted for that unreliability.

It probably could be changed now, but I'd want more aggressive testing of
the assertion of correctness (`random()`, as used in that StackOverflow
demonstration, does *not* exercise a lot of the important edge cases in the
floating point format). But if your true concern is that 9% of disk space,
you probably don't want to be using `savetxt()` in any case.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: pickling dtype values

2023-11-20 Thread Robert Kern

On Mon, Nov 20, 2023 at 10:08 PM Sebastien Binet  wrote:

> hi there,
>
> I have written a Go package[1] that can read/write simple arrays in the
> numpy file format [2].
> when I wrote it, it was for simple interoperability use cases, but now
> people would like to be able to read back ragged-arrays[3].
>
> unless I am mistaken, this means I need to interpret pieces of pickled
> data (`ndarray`, `multiarray` and `dtype`).
>
> so I am trying to understand how to unpickle `dtype` values that have been
> pickled:
>
> ```python
> import numpy as np
> import pickle
> import pickletools as pt
>
> pt.dis(pickle.dumps(np.dtype("int32"), protocol=4), annotate=True)
> ```
>
> gives:
> ```
> 0: \x80 PROTO  4 Protocol version indicator.
> 2: \x95 FRAME  55 Indicate the beginning of a new frame.
>11: \x8c SHORT_BINUNICODE 'numpy' Push a Python Unicode string object.
>18: \x94 MEMOIZE(as 0)Store the stack top into the memo.
> The stack is not popped.
>19: \x8c SHORT_BINUNICODE 'dtype' Push a Python Unicode string object.
>26: \x94 MEMOIZE(as 1)Store the stack top into the memo.
> The stack is not popped.
>27: \x93 STACK_GLOBAL Push a global object (module.attr) on
> the stack.
>28: \x94 MEMOIZE(as 2)Store the stack top into the memo.
> The stack is not popped.
>29: \x8c SHORT_BINUNICODE 'i4'Push a Python Unicode string object.
>33: \x94 MEMOIZE(as 3)Store the stack top into the memo.
> The stack is not popped.
>34: \x89 NEWFALSE Push False onto the stack.
>35: \x88 NEWTRUE  Push True onto the stack.
>36: \x87 TUPLE3   Build a three-tuple out of the top
> three items on the stack.
>37: \x94 MEMOIZE(as 4)Store the stack top into the memo.
> The stack is not popped.
>38: RREDUCE   Push an object built from a callable
> and an argument tuple.
>39: \x94 MEMOIZE(as 5)Store the stack top into the memo.
> The stack is not popped.
>40: (MARK Push markobject onto the stack.
>41: KBININT13 Push a one-byte unsigned integer.
>43: \x8c SHORT_BINUNICODE '<' Push a Python Unicode string object.
>46: \x94 MEMOIZE(as 6)Store the stack top into the memo.
> The stack is not popped.
>47: NNONE Push None on the stack.
>48: NNONE Push None on the stack.
>49: NNONE Push None on the stack.
>50: JBININT -1Push a four-byte signed integer.
>55: JBININT -1Push a four-byte signed integer.
>60: KBININT10 Push a one-byte unsigned integer.
>62: tTUPLE  (MARK at 40) Build a tuple out of the topmost
> stack slice, after markobject.
>63: \x94 MEMOIZE(as 7)   Store the stack top into the
> memo.  The stack is not popped.
>64: bBUILD   Finish building an object, via
> __setstate__ or dict update.
>65: .STOPStop the unpickling machine.
> highest protocol among opcodes = 4
> ```
>
> I have tried to find the usual `__reduce__` and `__setstate__` methods to
> understand what are the various arguments, to no avail.
>

First, be sure to read the generic `object.__reduce__` docs:

https://docs.python.org/3.11/library/pickle.html#object.__reduce__

Here is the C source for `np.dtype.__reduce__()`:

https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2623-L2750

And `np.dtype.__setstate__()`:

https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2787-L3151

so, in :
> ```python
> >>> np.dtype("int32").__reduce__()[1]
> ('i4', False, True)
>

These are arguments to the `np.dtype` constructor and are documented in
`np.dtype.__doc__`. The `False, True` arguments are hardcoded and always
those values.


> >>> np.dtype("int32").__reduce__()[2]
> (3, '<', None, None, None, -1, -1, 0)
>

These are arguments to pass to `np.dtype.__setstate__()` after the object
has been created.

0. `3` is the version number of the state; `3` is typical for simple
dtypes; datetimes and others with metadata will bump this to `4` and use a
9-element tuple instead of this 8-element tuple.
1. `'<'` is the endianness flag.
2. If there are subarrays
<https://numpy.org/doc/stable/reference/arrays.dtypes.html#index-7> (e.g.
`np.dtype((np.int32, (2,2)))`), that info here.
3. If there are fields, a tuple of the names of the fields
4. If there are fields, the

[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern

On Fri, Nov 17, 2023 at 7:11 PM Aaron Meurer  wrote:

> rng.integers() (or np.random.randint) lets you specify lists for low
> and high. So you can just use rng.integers((0,)*len(dims), dims).
>
> Although I'm not seeing how to use this to generate a bunch of vectors
> at once. I would have thought something like size=(10, dims) would let
> you generate 10 vectors of length dims but it doesn't seem to work.
>

`size=(k, len(dims))`

def sample_indices(shape, size, rng=None):
rng = np.random.default_rng(rng)
ashape = np.array(shape)
seen = set()
while len(seen) < size:
dsize = size - len(seen)
seen.update(map(tuple, rng.integers(0, ashape, size=(dsize,
len(shape)
return list(seen)

That optimistic optimization makes this the fastest solution.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern

On Fri, Nov 17, 2023 at 5:34 PM Stefan van der Walt 
wrote:

> On Fri, Nov 17, 2023, at 14:28, Stefan van der Walt wrote:
>
> Attached is a script that implements this solution.
>
>
> And the version with set duplicates checking.
>

If you're going to do the set-checking yourself, then you don't need the
unbounded integer support. This is somewhat slower, probably due to the
greedy tuple conversions, but eliminates all of the complexity of
subclassing `Random`:

def sample_indices(shape, size, rng=None):
rng = np.random.default_rng(rng)
ashape = np.array(shape)
seen = set()
while len(seen) < size:
idx = tuple(rng.integers(0, ashape))
seen.add(idx)
return list(seen)

Unfortunately, subclassing from `Random` still doesn't get you the ability
to sample without replacement for arbitrary-sized populations using
`Random`'s own methods for that. `Random.sample(population, k)` requires
`population` to be a `Sequence`, and that restricts you to index-sized
integers (`ssize_t`) (you can try to fake one, but `len()` will balk if it
gets a too-large integer from `__len__`). But `Random.sample` is only just
doing set-checking anyways, so no loss.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern

On Fri, Nov 17, 2023 at 4:15 PM Aaron Meurer  wrote:

> On Fri, Nov 17, 2023 at 12:10 PM Robert Kern 
> wrote:
> >
> > If the arrays you are drawing indices for are real in-memory arrays for
> present-day 64-bit computers, this should be adequate. If it's a notional
> array that is larger, then you'll need actual arbitrary-sized integer
> sampling. The builtin `random.randrange()` will do arbitrary-sized integers
> and is quite reasonable for this task. If you want it to use our
> BitGenerators underneath for clean PRNG state management, this is quite
> doable with a simple subclass of `random.Random`:
> https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258
>
> Wouldn't it be better to just use random.randint to generate the
> vectors directly at that point? If the number of possibilities is more
> than 2**64, birthday odds of generating the same vector twice are on
> the order of 1 in 2**32. And you can always do a unique() rejection
> check if you really want to be careful.
>

Yes, I jumped to correcting the misreading of the docstring rather than
solving the root problem. Almost certainly, the most straightforward
strategy is to use `rng.integers(0, array_shape)` repeatedly, storing
tuples of the resulting indices in a set and rejecting duplicates. You'd
have to do the same thing if one used `rng.integers(0,
np.prod(array_shape))` or `random.randint(0, np.prod(array_shape))` in the
flattened case.

`rng.integers()` doesn't sample without replacement in any case.
`rng.choice()` does. It also will not support unbounded integers; it uses a
signed `int64` to hold the population size, so it is bounded to that. That
is _likely_ to be a fine upper bound on the size of the array (if it is a
real array in a 64-bit address space). So one could use
`rng.choice(np.prod(array_shape), replace=False, size=n_samples)` and
unflatten these integers to the index tuples if that suits the array size.
It's just doing the same set lookups internally, just somewhat more
efficiently.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: How to sample unique vectors

2023-11-17 Thread Robert Kern

On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion <
numpy-discussion@python.org> wrote:

> Hi all,
>
> I am trying to sample k N-dimensional vectors from a uniform distribution
> without replacement.
> It seems like this should be straightforward, but I can't seem to pin it
> down.
>
> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x
> dN-1 array.
>
> I thought about sneaking in a structured dtype into `rng.integers`, but of
> course that doesn't work.
>
> If we had a string sampler, I could sample k unique words (consisting of
> digits), and convert them to indices.
>
> I could over-sample and filter out the non-unique indices. Or iteratively
> draw blocks of samples until I've built up my k unique indices.
>
> The most straightforward solution would be to flatten indices, and to
> sample from those. The integers get large quickly, though. The rng.integers
> docstring suggests that it can handle object arrays for very large integers:
>
> > When using broadcasting with uint64 dtypes, the maximum value (2**64)
> > cannot be represented as a standard integer type.
> > The high array (or low if high is None) must have object dtype, e.g.,
> array([2**64]).
>
> But, that doesn't work:
>
> In [35]: rng.integers(np.array([2**64], dtype=object))
> ValueError: high is out of bounds for int64
>
> Is there an elegant way to handle this problem?
>

The default dtype for the result of `integers()` is the signed `int64`. If
you want to sample from the range `[0, 2**64)`, you need to specify
`dtype=np.uint64`. The text you are reading is saying that if you want to
specify exactly `2**64` as the exclusive upper bound that you won't be able
to do it with a `np.uint64` array/scalar because it's one above the bound
for that dtype, so you'll have to use a plain Python `int` object or
`dtype=object` array in order to represent `2**64`. It is not saying that
you can draw arbitrary-sized integers.

>>> rng.integers(2**64, dtype=np.uint64)
11569248114014186612

If the arrays you are drawing indices for are real in-memory arrays for
present-day 64-bit computers, this should be adequate. If it's a notional
array that is larger, then you'll need actual arbitrary-sized integer
sampling. The builtin `random.randrange()` will do arbitrary-sized integers
and is quite reasonable for this task. If you want it to use our
BitGenerators underneath for clean PRNG state management, this is quite
doable with a simple subclass of `random.Random`:
https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0

2023-10-11 Thread Robert Kern

On Wed, Oct 11, 2023 at 11:50 PM Aaron Meurer  wrote:

> Is there a way to make pickle not depend on the specific submodule
> that a class is defined in?

No. `Unpickler` somehow has to locate the class/reconstruction function. It
will have to use the name given by the `__reduce_ex__` during pickling.

> Wouldn't this happen again if you ever
> decided to rename _core.
>

Yes.

> The underscores in numpy._core._reconstruct don't actually do anything
> here in terms of making the interface not public, and if anything, are
> really misleading.
>

There's private and there's private. There are two broad things that could
mean:

1. We don't want users playing around with it, importing it directly in
their code and using it.
2. Marking that we won't mess around with it without going through the
deprecation policy.

Usually, these two go together (hiding it from users directly calling it
means that we get to mess around with it freely), but when we get to
metaprogramming tasks like pickling, they become a little decoupled.
Because the "user" that we have to think about isn't a person reading
documentation and browsing APIs, but a file that someone created years ago
and some infrastructure that interprets that file. I think it's good that
`_reconstruct` is hidden from human users and is distinct from almost all
of what constitutes "numpy's public API", but that doesn't mean that we
can't have a tiny third class of functions where we do preserve some
guarantees. It's only occasionally confusing to us core devs, not users
(who can rely on the "don't touch underscored names" rule just fine).

I'm also curious about this more generally. We tend to think of the
> fully qualified name of a class as being an implementation detail for
> many libraries and only the top-level lib.Name should be used, but
> many things in the language (including this) break this.
>

Yes, and it's why these things attempt to be scoped in ways that limit this
problem. I.e. if you use pickling, you're told to use it only for transient
data with the same versions of libraries on both ends of the pipe, but the
reality is that it's too useful to avoid in creating files with arbitrarily
long lives. Not their fault; they warned us!

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Find location of slice in it's base

2023-08-31 Thread Robert Kern

On Thu, Aug 31, 2023 at 3:25 PM Dom Grigonis  wrote:

> Hi everyone,
>
> I am working with shared arrays and their slices. And trying to subclass
> ndarray in such way so that they remap to memory on unpickling. I am using
> package SharedArray, which doesn’t micro-manage memory locations, but
> rather stores the whole map via shm using unique name. One issue I ran into
> is that when I pickle & unpickle slice of memory mapped array, I need to
> store the spec of the subset so that I can retrieve the same portion.
>
> The issue I am working on has been briefly discussed on stack:
> https://stackoverflow.com/questions/12421770/find-location-of-slice-in-numpy-array.
> Although I am not sure if it is the exact same one.
>
> So, in short, is there a way to determine location of a slice in original
> array. Ideally, it would be indexing which works for any non-copy
> slice/subset possible.
>

If you chase the `.base` chain all the way down, you get to a
`SharedArray.map_owner` object with its `.addr` attribute giving the memory
address that's assigned to it in the current process. This will be the same
address that's in the `'data'` key of that first `ndarray` returned from
`SharedArray.create()` that you are making your views from. In your view
`ndarray` (which may be a view of views, with slices, transposes, reshapes,
and arbitrary `stride_tricks` manipulations in between). If you store the
difference between the view `ndarray`'s `'data'` address from the
`map_owner.addr` address, I think that's all you need to recreate the array
in a different process which gets assigned a different memory address for
the same `shm` name. Just add that offset to the `map_owner.addr` and
restore the rest of the information in `__array_interface__`, and I think
you should be good to go. I don't think you'll need to infer or represent
the precise path of Python-level operations (slices, transposes, reshapes,
etc.) to which it got to that point.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Inquiry about the difference between numpy.trunc() and numpy.fix()

2023-08-23 Thread Robert Kern

On Wed, Aug 23, 2023 at 10:06 PM Bao Guo  wrote:

> I am writing to seek clarification on the specific differences between the
> numpy.trunc() and numpy.fix() functions.
>
> I have come across opinions stating that these two functions exist for
> backward compatibility and that they serve different functional
> requirements. However, I am puzzled as to why these functions could not be
> merged or one of them upgraded to encompass both functionalities. Despite
> my efforts in researching and contemplating on this matter, I have not come
> across a convincing explanation.
>
> Could anyone kindly shed some light on this matter? I am seeking a deeper
> understanding of the design choices made for these functions and the
> rationale behind their coexistence.
>
> Thank you in advance for your time and assistance.

`fix()` was added to numpy back in the super-early days when it was
emerging from Numeric, its earlier predecessor. It was named after the
equivalent MATLAB function.

At some point thereafter, we settled on a policy for inclusion of
mathematical ufuncs into numpy. We got tired of arguing whether a special
function was "important enough" to be in numpy instead of scipy.special.
Eventually, we settled on a bright line that if a math function was in the
C99 standard, we'd go ahead and allow a ufunc for it in numpy. `trunc()` is
the name of that equivalent function in the C99 standard. It is a ufunc
instead of a plain function like `fix()` and has a (strict, I believe)
superset of functionality. You can ignore `fix()`, more or less. I'm not
sure if it's on the list for deprecation/removal in numpy 2.0, but it
certainly could be.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Arbitrarily large random integers

2023-08-19 Thread Robert Kern

On Sat, Aug 19, 2023 at 10:49 AM Kevin Sheppard 
wrote:

> The easiest way to do this would to to write a pure python implementation
> using Python ints of a masked integer sampler.  This way you could draw
> unsigned integers and then treat this as a bit pool.  You would than take
> the number of bits needed for your integer, transform these to be a Python
> int, and finally apply the mask.
>

Indeed, that's how `random.Random` does it. I've commented on the issue
with an implementation that subclasses `random.Random` to use numpy PRNGs
as the source of bits for maximum compatibility with `Random`. The given
use case motivating this feature request is networkx, which manually wraps
numpy PRNGs in a class that incompletely mimics the `Random` interface. A
true subclass eliminates all of the remaining inconsistencies between the
two. I'm inclined to leave it at that and not extend the `Generator`
interface.

https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.

2023-08-11 Thread Robert Kern

On Fri, Aug 11, 2023 at 1:47 PM Benjamin Root  wrote:

> I'm really confused. Summing from zero should be what cumsum() does now.
>
> ```
> >>> np.__version__
> '1.22.4'
> >>> np.cumsum([[1, 2, 3], [4, 5, 6]])
> array([ 1,  3,  6, 10, 15, 21])
> ```
> which matches your example in the cumsum0() documentation. Did something
> change in a recent release?
>

That's not what's in his example.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: np.dot yields a different result when computed in two pieces

2023-07-25 Thread Robert Kern

Accelerated BLAS operations are accelerated precisely by taking these
opportunities to rearrange the computations, not just (or primarily) by
parallelism. They are very finely tuned kernels that use the fine details
of the CPU/FPU to pipeline instructions (which might be SIMD) and optimize
memory movement and register use. In particular, SIMD instructions might be
used, on your machine, to deal with rows in batches of 4, giving the
pattern that you are seeing.

I, personally, am not willing to give up those optimizations for a slightly
more predictable order of additions.

On Tue, Jul 25, 2023 at 8:04 AM Junyan Xu  wrote:

> Hello everyone,
>
> I asked this question
> https://stackoverflow.com/questions/76707696/np-dot-yields-a-different-result-when-computed-in-two-pieces
> on StackOverflow a few days ago but still haven't got an answer. Long story
> short, I discovered (in certain common settings like Google Colab) that the
> method NumPy uses to compute the dot product of a 4D vector with a 4 x m
> matrix depends on m, so that if you cut a 4 x 2m matrix into two 4 x m
> matrices and compute the dot products with the vector individually, you get
> a slightly different result. More precisely, I found that:
>
> For m = 1, the order ((0+1)+2)+3 is used;
> for m = 2 and 3, the order (0+1)+(2+3) is used for all entries;
> for 4 ≤ m ≤ 7, unknown method is used for the first 4 entries, and the
> order ((0+1)+2)+3 is used for the rest;
> for 8 ≤ m ≤ 11, unknown method is used for the first 8 entries, and the
> order ((0+1)+2)+3 is used for the rest;
> for 12 ≤ m ≤ 15, unknown method is used for the first 12 entries, and the
> order ((0+1)+2)+3 is used for the rest;
> and the pattern probably continues.
>
> If we instead take the dot product of a m x 4 matrix with a 4D vector
> (rowMode = True in my code):
>
> For m = 1, the order ((0+1)+2)+3 is used;
> for m ≥ 2, the order (0+2)+(1+3) is used for all entries (verified up to m
> = 17).
>
> Moreover, if you replace 4 by an arbitrary n, the difference between the
> results grows, super-linearly in n in the vec-dot-mat case, but
> sub-linearly in the mat-dot-vec case.
>
> I think this behavior is highly non-intuitive and probably unnecessary: we
> are simply repeating the same operation m times (possibly in parallel), why
> should the method chosen depends on how many times the operation is
> repeated? Therefore, I'd like to propose an enhancement item to make the
> result independent of m. I realize this may need to be done upstream, so as
> a first step we may need to find out which BLAS library functions etc. are
> called to compute dot products and which are responsible for this behavior.
>
> Notebook for figuring out order of addition:
> https://colab.research.google.com/drive/1O4NtTWMiZbSxKFVg-lfBJDVypIBdVJ22?usp=sharing
> Notebook for growth with n:
> https://colab.research.google.com/drive/10ecRWvcMEXY0RchVlaLyPUFHJBA4KwA6?usp=sharing
>
> Thank you in advance for your thoughts on this.
>
> Best regards,
>
> Junyan
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: robert.k...@gmail.com
>


-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: mixed mode arithmetic

2023-07-11 Thread Robert Kern

On Tue, Jul 11, 2023 at 10:11 AM Matti Picus  wrote:

>
> On 10/7/23 16:13, Jens Glaser via NumPy-Discussion wrote:
> > Hi Matti,
> >
> > The documentation for numpy.dot currently states
> >
> > """
> > out
> > ndarray, optional
> > Output argument. This must have the exact kind that would be returned if
> it was not used. In particular, it must have the right type, must be
> C-contiguous, and its dtype must be the dtype that would be returned for
> dot(a,b). This is a performance feature. Therefore, if these conditions are
> not met, an exception is raised, instead of attempting to be flexible.
> > """
> >
> > I think this means that if dot(a,b) returned FP32 for FP16 inputs, it
> would be consistent with this API to supply a full precision output array.
> All that would be needed in an actual implementation is a mixed_precision
> flag (or output_dtype option) for this op to override the usual type
> promotion rules. Do you agree?
> >
> > Jens
>
> `np.dot` is strange. Could you use `np.matmul` instead, which is a real
> ufunc and (I think) already does this?
>

Sort of. As currently implemented, no, it won't do what Jens wants because
there is no `ee->f` loop implemented (`e` is the typecode for float16, `f`
for float32). Passing float16 operands and requesting a float32 output
(either with `dtype=np.float32` or providing such an `out=`) will fall down
to the `ff->f` loop and cause upcasting of the operands, which is not what
they want. But notionally one could add an `ee->f` loop between those two
that would catch this case when `dtype=np.float32` is requested.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: mixed mode arithmetic

2023-07-10 Thread Robert Kern

On Mon, Jul 10, 2023 at 1:49 AM Matti Picus  wrote:

> On 9/7/23 23:34, glaserj--- via NumPy-Discussion wrote:
>
> > Reviving this old thread - I note that numpy.dot supports in-place
> > computation for performance reasons like this
> > c = np.empty_like(a, order='C')
> > np.dot(a, b, out=c)
> >
> > However, the data type of the pre-allocated c array must match the
> result datatype of a times b. Now, with some accelerator hardware (i.e.
> tensor cores or matrix multiplication engines in GPUs), mixed precision
> arithmetics with relaxed floating point precision (i.e.., which are not
> necessarily IEEE754 conformant) but with faster performance are possible,
> which could be supported in downstream libraries such as cupy.
> >
> > Case in point, a mixed precision calculation may take half precision
> inputs, but accumulate in and return full precision outputs. Due to the
> above mentioned type consistency, the outputs would be unnecessarily
> demoted (truncated) to half precision again. The current API of numpy does
> not expose mixed precision concepts. Therefore, it would be nice if it was
> possible to build in support for hardware accelerated linear algebra, even
> if that may not be available on the standard (CPU) platforms numpy is
> typically compiled for.
> >
> > I'd be happy to flesh out some API concepts, but would be curious to
> first get an opinion from others. It may be necessary to weigh the
> complexity of adding such support explicitly against providing minimal
> hooks for add-on libraries in the style of JMP (for jax.numpy), or AMP (for
> torch).
> >
> > Jens
>
> If your goal is "accumulate in and return full precision outputs", then
> you can allocate C as the full precision type, and NumPy should do the
> right thing. Note it may convert the entire input array to the final
> dtype, rather than doing it "on the fly" which could be expensive in
> terms of memory.
>

In this case, no, `np.dot()` will raise an exception if it's not the dtype
that `np.dot(a, b)` would naturally produce. It's different from the
ufuncs, which do indeed behave like you describe. `np.dot()` implements its
own dtype coercion logic.

Jens, there's nothing that really prevents adding mixed precision
operations. ufuncs let you provide loops with mixed dtypes, mostly to
support special functions that take both integer and real arguments (or
real and complex). But one could conceivably put in mixed-precision loops.
For non-ufuncs like `np.dot()` that implement their own dtype-coercion
logic, it's just a matter of coding it up. The main reasons that we don't
are that it's a lot of work for possibly marginal gain to put in all of the
relevant permutations and that it complicates an already overly-complicated
and not-fully-coherent type promotion scheme.

`np.dot()` is kind of an oddball already, and "half-precision inputs ->
full-precision outputs" might be a worthwhile use case given hardware
accelerators. Given that this largely affects non-numpy implementations of
the Array API, you probably want to raise it with that group. numpy can
implement that logic if the Array API requires it.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Create a method to index N-dim tensors using 1D index #23992

2023-06-20 Thread Robert Kern

On Tue, Jun 20, 2023 at 11:38 AM Daniel Salles Civitarese <
sall...@br.ibm.com> wrote:

> ### Proposed new feature or change:
>
> I work with geospatial data that requires tensors with many dimensions.
> One challenge I used to have was when I had to implement a `__getitem__` to
> access those tensors using a 1D index. One use case is machine learning,
> when one needs to feed models sequentially with sub-tensors of the original
> one. A simple example is a matrix of shape 2x2 that has the positions `(0,
> 0)`, `(0, 1)`, `(1, 0)`, `(1, 1)`, and I want to access it via indices `[0,
> 1, 2, 3]`. So, when I say `__getitem__(2)`, I want the position `(1, 0)`.
>

If the only reason you want to compute the indices is to access the data
using those flattened indices, we use the `.flat` property instead and
avoid explicitly computing those indices.

```
[~]
|21> x = np.arange(2*3).reshape((2, 3))

[~]
|22> x
array([[0, 1, 2],
   [3, 4, 5]])

[~]
|23> x.flat[[0, 1, 2, 3]]
array([0, 1, 2, 3])
```

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Precision changes to sin/cos in the next release?

2023-05-31 Thread Robert Kern

On Wed, May 31, 2023 at 5:51 PM Benjamin Root  wrote:

> I think it is the special values aspect that is most concerning. Math is
> just littered with all sorts of identities, especially with trig functions.
> While I know that floating point calculations are imprecise, there are
> certain properties of these functions that still held, such as going from
> -1 to 1.
>
> As a reference point on an M1 Mac using conda-forge:
> ```
> >>> import numpy as np
> >>> np.__version__
> '1.24.3'
> >>> np.sin(0.0)
> 0.0
> >>> np.cos(0.0)
> 1.0
> >>> np.sin(np.pi)
> 1.2246467991473532e-16
> >>> np.cos(np.pi)
> -1.0
> >>> np.sin(2*np.pi)
> -2.4492935982947064e-16
> >>> np.cos(2*np.pi)
> 1.0
> ```
>
> Not perfect, but still right in most places.
>

FWIW, those ~0 answers are actually closer to the correct answers than 0
would be because `np.pi` is not actually π. Those aren't problems in the
implementations of np.sin/np.cos, just the intrinsic problems with floating
point representations and the choice of radians which places particularly
special values at places in between adjacent representable floating point
numbers.

> I'm ambivalent about reverting. I know I would love speed improvements
> because transformation calculations in GIS is slow using numpy, but also
> some coordinate transformations might break because of these changes.
>

Good to know. Do you have any concrete example that might be worth taking a
look at in more detail? Either for performance or accuracy.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Precision changes to sin/cos in the next release?

2023-05-31 Thread Robert Kern

On Wed, May 31, 2023 at 5:01 PM David Menéndez Hurtado <
davidmen...@gmail.com> wrote:

> On Wed, 31 May 2023, 22:41 Robert Kern,  wrote:
>
>>  Not sure anyone really uses tanh for serious work.
>>
>
> At the risk of derailing the discussion, the case I can think of (but kind
> of niche) is using neural networks to approximate differential equations.
> Then you need non linearities in the gradients everywhere.
>

Fair. I actually meant to delete that sentence during the last editing pass.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Precision changes to sin/cos in the next release?

2023-05-31 Thread Robert Kern

On Wed, May 31, 2023 at 12:37 PM Devulapalli, Raghuveer <
raghuveer.devulapa...@intel.com> wrote:

> I wouldn’t the discount the performance impact on real world benchmarks
> for these functions. Just to name a couple of examples:
>
>
>
>- 7x speed up of np.exp and np.log results in a 2x speed up of
>training neural networks like logistic regression [1]. I would expect
>np.tanh will show similar results for neural networks.
>- Vectorizing even simple functions like np.maximum results in a 1.3x
>speed up of sklearn’s Kmeans algorithm [2]
>
>
>
> Raghuveer
>
>
>
> [1] https://github.com/numpy/numpy/pull/13134
>
> [2] https://github.com/numpy/numpy/pull/14867
>

Perfect, those are precisely the concrete use cases I would want to see so
we can talk about the actual ramifications of the changes.

These particular examples suggest to me that a module or package providing
fast-inaccurate functions would be a good idea, but not across-the-board
fast-inaccurate implementations (though it's worth noting that the
exp/log/maximum replacements that you cite don't seem to be particularly
inaccurate). The performance improvements show up in situational use cases.
Logistic regression is not really a neural network (unless if you squint
real hard) so the loss function does take a significant amount of whole
function performance; the activation and loss functions of real neural
networks take up a rather small amount of time compared to the matmuls.
Nonetheless, people do optimize activation functions, but often by avoiding
special functions entirely with ReLUs (which have other benefits in terms
of nice gradients). Not sure anyone really uses tanh for serious work.

ML is a perfect use case for *opt-in* fast-inaccurate implementations. The
whole endeavor is to replace complicated computing logic with a smaller
number of primitives that you can optimize the hell out of, and let the
model size and training data size handle the complications. And a few
careful choices by people implementing the marquee packages can have a
large effect. In the case of transcendental activation functions in NNs, if
you really want to optimize them, it's a good idea to trade *a lot* of
accuracy (measured in %, not ULPs) for performance, in addition to doing it
on GPUs. And that makes changes to the `np.*` implementations mostly
irrelevant for them, and you can get that performance without making anyone
else pay for it.

Does anyone have compelling concrete use cases for accelerated trig
functions, per se, rather than exp/log and friends? I'm more on board with
accelerating those than trig functions because of their role in ML and
statistics (I'd still *prefer* to opt in, though). They don't have many
special values (which usually have alternates like expm1 and log1p to get
better precision in any case). But for trig functions, I'm much more likely
to be doing geometry where I'm with Archimedes: do not disturb my circles!

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Precision changes to sin/cos in the next release?

2023-05-31 Thread Robert Kern

On Wed, May 31, 2023 at 10:40 AM Ralf Gommers 
wrote:

>
>
> On Wed, May 31, 2023 at 4:19 PM Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
>>
>>
>> On Wed, May 31, 2023 at 8:05 AM Robert Kern 
>> wrote:
>>
>>> I would much, much rather have the special functions in the `np.*`
>>> namespace be more accurate than fast on all platforms. These would not
>>> have been on my list for general purpose speed optimization. How much time
>>> is actually spent inside sin/cos even in a trig-heavy numpy program? And
>>> most numpy programs aren't trig-heavy, but the precision cost would be paid
>>> and noticeable even for those programs. I would want fast-and-inaccurate
>>> functions to be strictly opt-in for those times that they really paid off.
>>> Probably by providing them in their own module or package rather than a
>>> runtime switch, because it's probably only a *part* of my program that
>>> needs that kind of speed and can afford that precision loss while there
>>> will be other parts that need the precision.
>>>
>>>
>> I think that would be a good policy going forward.
>>
>
> There's a little more to it than "precise and slow good", "fast == less
> accurate == bad". We've touched on this when SVML got merged (e.g., [1])
> and with other SIMD code, e.g. in the "Floating point precision
> expectations in NumPy" thread [2]. Even libm doesn't guarantee the best
> possible result of <0.5 ULP max error, and there are also considerations
> like whether any numerical errors are normally distributed around the exact
> mathematical answer or not (see, e.g., [3]).
>

If we had a portable, low-maintenance, high-accuracy library that we could
vendorize (and its performance cost wasn't *horrid*), I might even advocate
that we do that. Reliance on platform libms is mostly about our maintenance
burden than a principled accuracy/performance tradeoff. My preference *is*
definitely firmly on the "precise and slow good" for these ufuncs because
of the role these ufuncs play in real numpy programs; performance has a
limited, situational effect while accuracy can have substantial ones across
the board.

It seems fairly clear that with this recent change, the feeling is that the
> tradeoff is bad and that too much accuracy was lost, for not enough
> real-world gain. However, we now had several years worth of performance
> work with few complaints about accuracy issues.
>

Except that we get a flurry of complaints now that they actually affect
popular platforms. I'm not sure I'd read much into a lack of complaints
before that.

> So I wouldn't throw out the baby with the bath water now and say that we
> always want the best accuracy only. It seems to me like we need a better
> methodology for evaluating changes. Contributors have been pretty careful,
> but looking back at SIMD PRs, there were usually detailed benchmarks but
> not always detailed accuracy impact evaluations.
>

I've only seen micro-benchmarks testing the runtime of individual
functions, but maybe I haven't paid close enough attention. Have there been
any benchmarks on real(ish) *programs* that demonstrate what utility these
provide in even optimistic scenarios? I care precisely <1ULP about the
absolute performance of `np.sin()` on its own. There are definitely
programs that would care about that; I'm not sure any of them are (or
should be) written in Python, though.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Precision changes to sin/cos in the next release?

2023-05-31 Thread Robert Kern

I would much, much rather have the special functions in the `np.*`
namespace be more accurate than fast on all platforms. These would not
have been on my list for general purpose speed optimization. How much time
is actually spent inside sin/cos even in a trig-heavy numpy program? And
most numpy programs aren't trig-heavy, but the precision cost would be paid
and noticeable even for those programs. I would want fast-and-inaccurate
functions to be strictly opt-in for those times that they really paid off.
Probably by providing them in their own module or package rather than a
runtime switch, because it's probably only a *part* of my program that
needs that kind of speed and can afford that precision loss while there
will be other parts that need the precision.

On Wed, May 31, 2023 at 1:59 AM Sebastian Berg 
wrote:

> Hi all,
>
> there was recently a PR to NumPy to improve the performance of sin/cos
> on most platforms (on my laptop it seems to be about 5x on simple
> inputs).
> This changes the error bounds on platforms that were not previously
> accelerated (most users):
>
> https://github.com/numpy/numpy/pull/23399
>
> The new error is <4 ULP similar to what it was before, but only on high
> end Intel CPUs which not users would have noticed.
> And unfortunately, it is a bit unclear whether this is too disruptive
> or not.
>
> The main surprise is probably that the range of both does not include 1
> (and -1) exactly with this and quite a lot of downstream packages
> noticed this and needed test adaptions.
>
> Now, most of these are harmless: users shouldn't expect exact results
> from floating point math and test tolerances need adjustment.  OTOH,
> sin/cos are practically 1/-1 on a wide range of inputs (they are
> basically constant) so it is surprising that they deviate from it and
> never reach 1/-1 exactly.
>
> Since quite a few downstream libs notice this and NumPy users cannot
> explicitly opt-in to a different performance/precision trade-off.  The
> question is how everyone feels about it being better to revert for now
> and hope for a better one?
>
> I doubt we can decide on a very clear cut yes/no, but I am very
> interested what everyone thinks whether this precision trade-off is too
> surprising to users?
>
> Cheers,
>
> Sebastian
>
>
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: robert.k...@gmail.com
>


-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Proposal: "intrange" for inclusive integer intervals

2023-05-04 Thread Robert Kern

sive floating point ranges by scaling:
>
> intrange(10, 15, bound="closed") / 10
> [1.0, 1.1, 1.2, 1.3, 1.4, 1.5]
>
> This way, we don't have to worry about floating point imprecision screwing
> up the bounds calculation.
>
> Sketch implementation:
>
> def intrange(start, stop, step=1, bound="open", anchor="start"):
> match (anchor, bound):
> case ("start", "open"):
> return np.arange(start, stop, step)
> case ("start", "closed"):
> return np.arange(start, stop + step, step)
> case ("start", "exact"):
> result = np.arange(start, stop + step, step)
> result[-1] = stop
> return result
> case ("stop", "open"):
> return np.flip(np.arange(stop, start, -step))
> case ("stop", "closed"):
> return np.flip(np.arange(stop, start - step, -step))
> case ("stop", "exact"):
> result = np.flip(np.arange(stop, start - step, -step))
> result[0] = start
> return result
> case _:
> assert False
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: robert.k...@gmail.com
>


-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Managing integer overflow

2023-05-03 Thread Robert Kern

On Tue, May 2, 2023 at 8:03 AM  wrote:

> > I think this example shows that you don't need any special infrastructure
> > from numpy. I don't think there is going to be much appetite to expand
> our
> > API in this direction.
>
> But I do! I'm looking for something that implements the
> multiply_check_ov(X,Y) and similar functionality for addition and
> subtraction, for large arrays; I don't know how to do that efficiently. (Or
> even correctly for 64-bit operands --- for 8/16/32 I suppose as a baseline
> I could do the math in 64-bit arithmetic and if the answer doesn't match,
> then use the sign bit of the 64-bit result for determining overflow.)
>

Some C compilers provide some builtin functions that do this for you, and
some cross-platform headers around that show how to do this.

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
https://github.com/pytorch/pytorch/blob/main/c10/util/safe_numerics.h
https://github.com/dcleblanc/SafeInt

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Managing integer overflow

2023-04-27 Thread Robert Kern

Two boolean arrays are not more compact than one signed `int8` array. But
since you'd have to make bool masks from such an array anyways to do
anything useful, you might as well pass the bool masks.

I think this example shows that you don't need any special infrastructure
from numpy. I don't think there is going to be much appetite to expand our
API in this direction.

On Thu, Apr 27, 2023 at 9:13 AM  wrote:

> > Ideally if there is an overflow in an array operation I'd like to
> produce an "overflow array" that's the same size as the result array, but
> with the values +1 if it's a positive overflow or -1 if it's a negative
> overflow.
>
> Alternatively, if boolean arrays are a much more compact memory footprint,
> then return two overflow arrays, one positive and one negative.
>
> Use case:
>
> X = ... something ...
> Y = ... something else ...
> Z, ovpos, ovneg = multiply_check_ov(X,Y)
> if ovpos.any():
>np.iinfo(Z.dtype).max
> if ovneg.any():
># uh oh, something we didn't expect, so raise an error
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: robert.k...@gmail.com
>


-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Giving deprecation of e.g. `float(np.array([1]))` a shot (not 0-d)

2023-04-20 Thread Robert Kern

On Thu, Apr 20, 2023 at 12:39 PM Evgeni Burovski 
wrote:

> If symmetry w.r.t. pytorch is any guide, it was nice to have it:
>
> In [38]: float(torch.as_tensor([2]))
> Out[38]: 2.0
>
> In [39]: float(np.asarray([2]))
> Out[39]: 2.0
>

My question would be: Did they have a positive use case for this behavior,
or were they just reflecting NumPy's behavior?

AFAICR, the main reasoning on our side was that there was an unambiguous
value that we _could_ return, so we might as well. And in our later
experience, it was more trouble than it was worth.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: broadcasting question

2023-03-22 Thread Robert Kern

On Wed, Mar 22, 2023 at 9:34 AM Neal Becker  wrote:

> I have a function F
> def F(a, b):
> c = a * b
>
> Initially, a is a scalar, b[240,3000].  No problem.
> Later I want to use F, where a[240] is a vector.  I want to allow both the
> scalar and vector cases.  So I write:
>
> def F(a,b):
>   a = np.atleast_1d(a)
>   c = a[:,None] * b
>
> This now works for scalar a or vector a.  But this solutions seems
> inelegant, and somewhat fragile.  Suppose later we want to allow
> a[240,3000], a 2d array matching b.
>
> Certainly don't want to write code like:
> if a.ndim == 0:...
>
> Is there a more elegant/robust approach?
>

I would leave it as `c = a * b` and simply record in the docstring that `a`
and `b` should be broadcastable. Yes, that means that the user will have to
write `F(a[:, np.newaxis], b)` for that one case, and that looks a little
ugly, but overall it's less cognitive load on the user to just reuse the
common convention of broadcasting than to record the special case.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Dear Robert

2023-02-25 Thread Robert Kern

On Sat, Feb 25, 2023 at 8:17 PM Louis Petingi 
wrote:

> Hi Robert
>
> Just a follow up. I was able (my student) to get the 1 vector from the 0
> eigenvector. Even though the normalized or this set of eigenvectors will
> work we could try the two sets. Not sure if multiplying the unit vectors by
> a scalar is the correct approach.
>

Yeah, that essentially amounts to multiplying all of the eigenvectors by
`np.sqrt(n)`. I don't imagine it will do much to change the results of
K-means or other clustering algorithm unless if it has absolute distance
thresholds embedded in it. But it's absolutely a free choice for you to
precondition things for your clustering algorithm.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Dear Robert

2023-02-25 Thread Robert Kern

On Sat, Feb 25, 2023 at 5:33 PM Louis Petingi 
wrote:

> As you mentioned this is a generalization of the Fiedler eigenvector. When
> applying spectral clustering, and you want to find the two clusters
> then the Fiedler eigenvector tells you how to partition the vertices
> (bipartition) so the normalized cut is minimized. The concept can be
> generalized to
> k clusters by applying K-mean to the first k eigenvectors. The two
> partitions can be determined by grouping the vertices of the corresponding
> negative values of the eigenvector
> for one cluster and the other cluster are vertices corresponding to the
> non-negative values.
>
> I can use perhaps the normalized eigenvectors, but I am not sure if this
> is correct thus, I prefer using the magnitudes. In theory it may work as
> for example the entries of the Fielder eigenvector are divided by the norm.
>

My understanding of this family of techniques is that the K-means step is
basically heuristic (see section 8.4 of [1]). You have a lot of choices
here, and playing around with the magnitudes might be among them. But there
is nothing that comes _from_ the eigenvector calculation that can help
determine what this should be. There's nothing special about the way that
`np.linalg.eig()` computes eigenvectors before normalization that will
improve the K-means step. Rather, you have the freedom to scale the
eigenvectors freely to help out the K-means calculation (or any other
clustering that you might want to do there). I suspect that the
unit-normalization is probably one of the better ones.

[1]
http://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf

> The eigenvalues and eigenvector calculators (in internet) find the
> magnitude  of the vector.
>

No, they do not find "the magnitude" because there is no such thing. They
just give _an_ arbitrary magnitude that happened to be the raw output of
their _particular_ algorithm without the following unit-normalization step.
There are lots of algorithms to compute eigenvectors. The magnitudes of the
eigenvector solutions before normalization are not all the same for all of
the different algorithms. There is absolutely no guarantee (and very little
probability) that if we modified `np.linalg.eig()` to not apply
normalization that the unnormalized eigenvectors would match the
unnormalized eigenvectors of any other implementation.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: non normalised eigenvectors

2023-02-25 Thread Robert Kern

On Sat, Feb 25, 2023 at 4:11 PM Louis Petingi 
wrote:

> Hi Thanks
>
> Very simply one of the solutions for the zero eigenvalue is the 1
> eigenvector.   If I get back this 1 vector, for the 0 eigenvalue then the
> other eigenvectors will be in the right format I am looking for. Once
> again, the 1 vector is the normalized eigenvector * norm.
>

There are multiple rules that could get this result. You could multiply all
of the eigenvectors by `sqrt(n)`. Alternately, you could multiply each
eigenvector `v` by `1/v[np.nonzero(v)[0][0]]` (i.e. the first nonzero value
in the eigenvector). Both would give you the desired result for the
0-eigenvalue eigenvector, but different results for every other
eigenvector. Both are simple and coherent rules, but quite likely neither
are what you are looking for. Remember, each eigenvector can be scaled
independently of any other, so establishing the desired result for one
eigenvector does nothing to constrain any other.

If you want help finding the rule that will help you in your research,
you'll have to let us know what you are going to use the magnitudes _for_.
The information you've given so far (where the eigenvectors come _from_)
doesn't actually help narrow anything down. The only application of
eigenvector magnitudes of graph Laplacians that I am aware of is the
Fiedler vector, and that actually requires unit eigenvectors.

--
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: non normalised eigenvectors

2023-02-25 Thread Robert Kern

On Sat, Feb 25, 2023 at 2:11 PM Louis Petingi 
wrote:

> Thank you for the reply. I am working with the Laplacian matrix of a graph
> which is the Degree matrix minus the adjacency matrix.
> The Laplacian is a symmetric matrix and the smallest eigenvalue is zero.
> As the rows add it to 0, Lx=0x, and 1 is the resulting vector. The
> normalized eigenvector is the 1 vector divided by the norm. So if a have 10
> vertices of the graph the normalized eigenvector is 1/sqrt(10). I do
> understand that a scale * normalized eigenvector is also a solution but for
> the purpose of my research I need the normalized eigenvector * norm.
>

I apologize, but I'm going to harp on this: you are using the word "the" as
if there is one unique magnitude that we could report. There is none. If
you have a use for a specific convention for the magnitudes, you'll have to
point us to the literature that talks about it, and we might be able to
give you pointers as to how to get the magnitudes that your research
requires.

For the 0 eigenvalue the norm of the eigenvector is easy to figure out but
> not for the other eigenvalues.
>
> That is what I meant by the original eigenvector and sorry
> for the confusion the confusion. Most eigenvalues/eigenvalues calculators
> will give you 1 for first eigenvector
>

I'm afraid that doesn't really narrow anything down for us.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: non normalised eigenvectors

2023-02-25 Thread Robert Kern

On Sat, Feb 25, 2023 at 11:39 AM  wrote:

> Dear all,
>
> I am not an expert in NumPy but my undergraduate student is having some
> issues with the way Numpy returns the normalized eigenvectors corresponding
> to the eigenvalues. We do understand that an eigenvector is divided by the
> norm to get the unit eigenvectors, however we do need the original vectors
> for the purpose of my research. This has been a really frustrated
> experience as NumPy returns the normalized vectors as a default. I
> appreciate any suggestions of how to go about this issue. This seems to be
> a outstanding issue from people using Numpy.
>

I'm not sure what you mean by "the original vectors". All multiples of the
unit eigenvector are eigenvectors. None have a claim on being "the original
vector". Do you have a reference for what you are referring to? It's
possible that there are specific procedures that happen to spit out vectors
that are eigenvectors but have semantics about the magnitude, but
`np.linalg.eig()` does not implement that procedure. The semantics about
the magnitude would be supplied by that specific procedure.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Advanced indexing doesn't follow the Principle of least astonishment

2022-12-29 Thread Robert Kern

On Thu, Dec 29, 2022 at 8:50 AM Diogo Valada 
wrote:

> Hi all,
>
> New to the mailing list, so I hope I'm creating a discussion in the right
> place.
>
> Am I the only one that thinks that Advanced indexing in numpy doesn't
> follow the principle of minimum astonishment?
>
> for example
>
> ```python
> a = np.random.rand(100, 100)
>
> a[(2,4)] #this yields the element at [2,4]
> a[[2,4]] #this yields the rows at position 2 and 4
> a[1, (2,4)] #this yields the 2nd and 4th elements of row 1. (So actually
> does advanced indexing)
> a[1, [2,4]] # Works the same way as the previous one.
> ```
>
> Worst of all, it's very easy for someone do a mistake and not notice it:
> it seems to me that the first method, a[(2,4)], should not be allowed, and
> instead only a[*(2,4)] should work. How checked how it works in Julia
> (which has a similar syntax), and a[(2,4)] would yield an error, which
> makes sense to me. Could it be an idea to deprecate a[(2,4)]-like usages?
>

No, that's not possible. In Python syntax, the comma `,` is what creates
the tuple, not the parentheses. So `a[(2,4)]` is exactly `a[2, 4]`. `a[2,
4]` translates to `a.__getitem__((2, 4))`. So there's no way for the array
to know whether it got `a[2, 4]` or `a[(2, 4)]`.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: How will Google DeepMind's AlphaTensor effect numpy

2022-10-26 Thread Robert Kern

On Wed, Oct 26, 2022 at 9:31 AM  wrote:

> Hello!
>
> I was curious on how AlphaTensor will effect NumPy and other similar
> applications considering it has found a way to perform 3x3 matrix
> multiplication efficiently.
> https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor.
> I am not even sure how NumPy does this under the hood, is it 2x2?
>
> Is anyone working on implementing this 3x3 algorithm for NumPy? Is it too
> early, and if so why? Are there any concerns about this algorithm?
>

numpy links against accelerated linear algebra libraries like OpenBLAS and
Intel MKL to provide the matrix multiplication. If they find that the
AlphaTensor results are better than the options they currently have, then
numpy will get them. In general, I think it is unlikely that they will be
used. Even the older state of the art that they compare with, like
Strassen's algorithm, are not often used in practice. Concerns like memory
movement and the ability to use instruction-level parallelism on each kind
of CPU tend to dominate over a marginal change in the number of
multiplication operations. The answers to this StackOverflow question give
some more information:

https://stackoverflow.com/questions/1303182/how-does-blas-get-such-extreme-performance

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Robert Kern

On Thu, Aug 25, 2022 at 3:47 PM Qianqian Fang  wrote:

> On 8/25/22 12:25, Robert Kern wrote:
>
> I don't quite know what this means. My installed version of `jq`, for
> example, doesn't seem to know what to do with these files.
>
> ❯ jq --version
> jq-1.6
>
> ❯ jq . eye5chunk_bjd_raw.jdb
> parse error: Invalid numeric literal at line 1, column 38
>
>
> the .jdb files are binary JSON files (specifically BJData) that jq does
> not currently support; to save as text-based JSON, you change the suffix to
> .json or .jdt - it results in ~33% increase compared to the binary due to
> base64
>
> Okay. Given your wording, it looked like you were claiming that the binary
JSON was supported by the whole ecosystem. Rather, it seems like you can
either get binary encoding OR the ecosystem support, but not both at the
same time.

> I think a fundamental problem here is that it looks like each element in
> the array is delimited. I.e. a `float64` value starts with b'D' then the 8
> IEEE-754 bytes representing the number. When we're talking about
> memory-mappability, we are talking about having the on-disk representation
> being exactly what it looks like in-memory, all of the IEEE-754 floats
> contiguous with each other, so we can use the `np.memmap` `ndarray`
> subclass to represent the on-disk data as a first-class array object. This
> spec lets us mmap the binary JSON file and manipulate its contents in-place
> efficiently, but that's not what is being asked for here.
>
> there are several BJData-compliant forms to store the same binary array
> losslessly. The most memory efficient and disk-mmapable (but not
> necessarily disk-efficient) form is to use the ND-array container syntax
> <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type>
> that BJData spec extended over UBJSON.
>
> Are any of them supported by a Python BJData implementation? I didn't see
any option to get that done in the `bjdata` package you recommended, for
example.

https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Robert Kern

ment in
the array is delimited. I.e. a `float64` value starts with b'D' then the 8
IEEE-754 bytes representing the number. When we're talking about
memory-mappability, we are talking about having the on-disk representation
being exactly what it looks like in-memory, all of the IEEE-754 floats
contiguous with each other, so we can use the `np.memmap` `ndarray`
subclass to represent the on-disk data as a first-class array object. This
spec lets us mmap the binary JSON file and manipulate its contents in-place
efficiently, but that's not what is being asked for here.

> 2. UBJSON/BJData natively support append-able root-level records; JSON has
> been extensively used in data streaming with appendable nd-json or
> concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)
>
>
> just a quick comparison of output file sizes with a 1000x1000 unitary
> diagonal matrix
>
> # python3 -m pip install jdata bjdata
> import numpy as np
> import jdata as jd
> x = np.eye(1000);   *# create a large array*
> y = np.vsplit(x, 5);*# split into smaller chunks*
> np.save('eye5chunk.npy',y); *# save npy*
> jd.save(y, 'eye5chunk_bjd_raw.jdb');*# save as uncompressed bjd*
> jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'});  *#
> zlib-compressed bjd*
> jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'});  *#
> lzma-compressed bjd*
> newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding*
> newx = np.concatenate(newy);*# regroup chunks*
> newx.dtype
>
>
> here are the output file sizes in bytes:
>
> 8000128  eye5chunk.npy
> 5004297  eye5chunk_bjd_raw.jdb
>
Just a note: This difference is solely due to a special representation of
`0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a
special value and uses the `float32` encoding of it). If you had any other
value making up the bulk of the file, this would be larger than the NPY due
to the additional delimiter b'D'.

>   10338  eye5chunk_bjd_zlib.jdb
>2206  eye5chunk_bjd_lzma.jdb
>
> Qianqian
>
-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: writing a known-size 1D ndarray serially as it's calced

2022-08-25 Thread Robert Kern

On Thu, Aug 25, 2022 at 4:27 AM Bill Ross  wrote:

> Thanks, np.lib.format.open_memmap() works great! With prediction procs
> using minimal sys memory, I can get twice as many on GPU, with fewer
> optimization warnings.
>
> Why even have the number of records in the header? Shouldn't record size
> plus system-reported/growable file size be enough?
>
Only in the happy case where there is no corruption. Implicitness is not a
virtue in the use cases that the format was designed for. There is an
additional use case where the length is unknown a priori where implicitness
would help, but the format was not designed for that case (and I'm not sure
I want to add that use case).

> I'd love to have a shared-mem analog for smaller-scale data; now I load
> data and fork to emulate that effect.
>
There are a number of ways to do that, including using memmap on files on a
memory-backed filesystem like /dev/shm/ on Linux. See this article for
several more options:


https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2

> My file sizes will exceed memory, so I'm hoping to get the most out of
> memmap. Will this in-loop assignment to predsum work to avoid loading all
> to memory?
>
> predsum = np.lib.format.open_memmap(outfile, mode='w+',
> shape=(ids_sq,), dtype=np.float32)
>
> for i in range(len(IN_FILES)):
>
> pred = numpy.lib.format.open_memmap(IN_FILES[i])
>
> predsum = np.add(predsum, pred) # <-
>
This will replace the `predsum` array with a new in-memory array the first
time through this loop. Use `out=predsum` to make sure that the output goes
into the memory-mapped array

  np.add(predsum, pred, out=predsum)

Or the usual augmented assignment:

  predsum += pred

> del pred
> del predsum
>

The precise memory behavior will depend on your OS's virtual memory
configuration. But in general, `np.add()` will go through the arrays in
order, causing the virtual memory system to page in memory pages as they
are accessed for reading or writing, and page out the old ones to make room
for the new pages. Linux, in my experience, isn't always the best at
managing that backlog of old pages, especially if you have multiple
processes doing similar kinds of things (in the past, I have seen *each* of
those processes trying to use *all* of the main memory for their backlog of
old pages), but there are configuration tweaks that you can make.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: writing a known-size 1D ndarray serially as it's calced

2022-08-23 Thread Robert Kern

On Tue, Aug 23, 2022 at 8:47 PM  wrote:

> I want to calc multiple ndarrays at once and lack memory, so want to write
> in chunks (here sized to GPU batch capacity). It seems there should be an
> interface to write the header, then write a number of elements cyclically,
> then add any closing rubric and close the file.
>
> Is it as simple as lib.format.write_array_header_2_0(fp, d)
> then writing multiple shape(N,) arrays of float by
> fp.write(item.tobytes())?
>

`item.tofile(fp)` is more efficient, but yes, that's the basic scheme.
There is no footer after the data.

The alternative is to use `np.lib.format.open_memmap(filename, mode='w+',
dtype=dtype, shape=shape)`, then assign slices sequentially to the returned
memory-mapped array. A memory-mapped array is usually going to be
friendlier to whatever memory limits you are running into than a nominally
"in-memory" array.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: generate a random sample based on independent priors in `random.choice` #22082

2022-08-04 Thread Robert Kern

As discussed on the corresponding Github issue[1], this is not going to be
functionality that we add onto `choice()` nor is it likely something that
we will add as a separate `Generator` method. The desired computation is
straightforward to do by composing existing functionality.

[1] https://github.com/numpy/numpy/issues/22082

On Thu, Aug 4, 2022 at 2:07 PM Rodo-Singh  wrote:

> Proposed new feature or change:
> Objective: Sample elements from the given iterator (a list, a numpy array,
> etc.) based upon pre-defined probabilities associated with each element
> which may not sums upto 1.
>
> Overview
> * In the numpy.random.choice function the cases where we're explicitly
> providing the list of probabilistic values for an input sample, the
> requirement expects the sum of the whole list to be 1. This makes sense
> when all the elements are possible observation for a single random variable
> whose pmf(probability mass function) is nothing but the p list.
> * But when every element in that list can be regarded as observation of
> separate independent Bernoulli r.v. (random variable), then the problem
> falls into the category of multi-label. Intuitively speaking, for each
> element in the list or array given to us, we'll toss a coin (may be biased
> or unbiased) ~B(p) (i.e., follows Bernoulli with p as success probability
> for head).
> * The output array or list would probably be a proper subset or can be a
> full list or can be an empty one.
> * Plus, here the argument size should automatically get shut down as we
> just want which all elements got selected into the output list after n coin
> tossess (where len(list) = n). Also it may happen that each element is
> independent as argued above, but the sum(p) = 1. Then we can probably put
> an extra argument independence = True or independence = False.
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: robert.k...@gmail.com
>


-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: seed for new random number generator

2022-06-21 Thread Robert Kern

On Tue, Jun 21, 2022 at 9:54 AM Pieter Eendebak 
wrote:

> Hi Robert,
>
> Thanks for the information and the offer. My use case is development of
> algorithms where I want to be able to test variations of the algorithm
> while keeping the same (simulated or generated) data. Related to that are
> example scripts part of the documentation that I want to always have the
> same result.
>
> Creating a generator and passing that (someway or the other) is a good
> approach, but it requires some refactoring of the code. Especially for
> third-party code I would like to avoid this (to be honest, most external
> code is still using the old numpy random number interface, so there we can
> use the old style of setting the seed)
>

Ah, okay, so you're replacing the old global with a new global Generator
instance in your own code as an easier transitional step.

# global instance
rng = np.random.default_rng()

def seed_global_rng(seed=None):
# Create the same type of BitGenerator with the given seed, and then
copy its state over to the global.
bg_cls = type(rng.bit_generator)
bg = bg_cls(seed)
rng.bit_generator.state = bg.state

I do recommend moving to passing around a Generator when you can; even in
the older system, we've always recommended passing around a RandomState
instance.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: seed for new random number generator

2022-06-19 Thread Robert Kern

On Sun, Jun 19, 2022 at 9:37 AM Pieter Eendebak 
wrote:

> Hi everyone,
>
> The new numpy random interface (e.g. r=numpy.random.default_rng; r.random)
> is much faster than the old one (e.g. np.random.random). When converting
>  code from the old style to the new style I miss having a way to set the
> seed of the RNG
>
> I tried:
>
> rng.bit_generator = np.random.PCG64(seed=42) # fails, with good error
> message
> rng.bit_generator.state['state']['state']=42 # has no effect, perhaps make
> this dict read-only?
>
> Is there a way to set the seed without creating a new RNG object?
>

We generally recommend just creating a new Generator and passing that
around in almost all cases. Whenever that can possibly be made to work,
please do that. The use of np.random.seed() is usually a code smell
indicating that someone was working around the fact that there was just the
one global underneath np.random.random() et al. When you don't have the
constraint of a single global instance, it's almost always going to be
better to use multiple instances; you don't need that workaround anymore.

There are ways to punch in a new BitGenerator into an existing Generator,
if you must, and also ways to punch in a state dict into the BitGenerator,
but these are largely for various meta-programming tasks like serialization
rather than day-to-day use of pseudorandom numbers. If you are converting
old code that happened to use np.random.seed(), it is almost certainly not
one of these tasks.

If you want to describe your use case more specifically, I can give you
some more guidance on good patterns to replace it with the new system.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: ndarray shape permutation

2022-05-17 Thread Robert Kern

On Tue, May 17, 2022 at 10:36 PM Joseph Fox-Rabinovitz <
jfoxrabinov...@gmail.com> wrote:

> You could easily write an extension to ndarray that maps axis names to
> indices and vice versa.
>

Indeed. There have been several over the years. The current most-maintained
one is xarray, near as I can tell.

  https://xarray.pydata.org/en/stable/index.html

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: ENH: Support Random Unit Vector

2022-05-17 Thread Robert Kern

On Tue, May 17, 2022 at 5:20 AM  wrote:

> ### Proposed new feature or change:
>
> Description
> 
>
> Often you want to randomize a "direction" that doesn't have "size". This
> can be useful when:
> * You want to randomize the direction of an NPC walk developing in a 2D or
> 3D games, where its speed is a predefined constant.
> * You're developing particles simulation with all particles moving in
> random directions
> * etc.
>
> This random direction should be an `n`th dimensional unit vector which is
> randomize **uniformly** from the unit sphere. That means that sections of
> the unit sphere with equal areas should have the same chance of getting a
> vector from.
>
> Implementation
> ===
>
> If one wants to randomize `n`th dimensional direction uniformly, one must
> do the following:
> 1. Generate `n` components of the vector using the standard normal
> distribution
> 2. Find the norm the generated vector
> 3. Divide the vector by its norm
>
> In simple Python code:
> ```python
> import numpy as np
>
>
> def sphere_uniform(n: int):
> vec = np.random.normal(size=n)
> norm = np.sqrt(np.sum(vec ** 2))
> return vec / norm
> ```
>
> This implementation suggestion is based on [this blog post](
> https://towardsdatascience.com/the-best-way-to-pick-a-unit-vector-7bd0cc54f9b).
> I'm not sure if there's a faster way of implementing it using C++ and
> Cython, but I'm confident that someone here will know.
>

I would probably use np.linalg.norm() to calculate the norm, but mostly for
readability.


> Why Implement It in Numpy?
> ===
>
> I believe that random unit vectors are common enough to be a part of
> Numpy. Scientists, gamers and engineers use random directions all the time,
> and I believe there is no reason why Numpy won't provide this ability.
>

As a general policy, we do not plan on adding more methods to Generator
(c.f. discussion of a previous suggested addition[1]). This would be a good
multivariate distribution to add to scipy.stats, though.

[1] https://www.mail-archive.com/numpy-discussion@python.org/msg04724.html

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Discussion: History & Possible Deprecation/Removal of numpy.linalg.{eig(), eigvals()}?

2022-05-03 Thread Robert Kern

On Tue, May 3, 2022 at 8:40 PM Leo Fang  wrote:

> Hi, I am catching up with an assigned task that I slipped. I am wondering
> a few things about the numpy.linalg.eig() API (eigvals() is also included
> in this discussion, but for simplicity let's focus on eig()). AFAIU the
> purpose of eig() is for handling general eigen-systems which have left and
> right eigenvectors (that could differ beyond transpose/conjugate).

For this, I have two questions:
>
> 1. What's the history of NumPy adding this eig() routine? My guess would
> be NumPy added it because Matlab had it, but I couldn't tell from a quick
> commit search.
>

It essentially dates back to Numeric, though the name was a bit different.
But the core feature of only computing right-eigenvectors was there.
scipy.linalg.eig() came along and expose both left- and right-eigenvalues
(with close to the present signature, I think). Then numpy took over from
Numeric and adopted the scipy.linalg names for the routines, but cut out a
number of the features that made some of the interfaces complicated. Since
Numeric only did right-eigenvalues (what most people want, even in the
general-matrix case), so did numpy.

> 2. Would it be possible to deprecate/remove this API? From past
> discussions with users and developers, I found:
> - numpy.linalg.eig() only returns the right eigenvectors, so it's not very
> useful:
>   * When the left eigenvectors are not needed --- which is the majority of
> the use cases --- eigh() is the right routine, but most users (especially
> those transitioning from Matlab) don't know it's there
>

eigh() is only for symmetric/Hermitian matrices.

>   * When the left eigenvectors are needed, users have to use
> scipy.linalg.eig() anyway
>

It seems to me that you are assuming that in the non-Hermitian case, the
right-eigenvectors are useless without the left-eigenvectors. It isn't
obvious to me why that would be the case. It is true that the left- and
right-eigenvectors differ from each other, but it's not clear to me that in
every application with a general matrix I need to compute both.

- A general support for eig() could be challenging (numerical issues), and
> doing SVDs instead is usually the better way out (in terms of both
> numerical stability and mathematical sense).
>

Computing the left-eigenvectors is no real problem. The underlying _GEEV
LAPACK routine can do it if asked. It's the same routine under
scipy.linalg.eig().

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Numpy array

2022-01-27 Thread Robert Kern

On Thu, Jan 27, 2022 at 8:18 AM  wrote:

> Hi, i am new to numpy. This is first time i am using numpy.
>
> https://github.com/mspieg/dynamical-systems/blob/master/Bifurcations.ipynb
>
> This code i found to create bifurcation graph. There is section of code
> which i am not able to understand
>
> vr[stable], vx[stable]
>
> vr and vx is array of 150 elements.
> unstable and stable array also has 150 elements
>
> But vr[stable], vx[stable] becomes 75 elements in the array. Which i am
> not able to umderstand how 150 elements in array drops to 75
>

`stable` is a boolean array which is True where `f_x(vr, vx) < 0` and False
otherwise. When a boolean array is used as an index, the resulting array
only has the elements where the boolean value in the index array is True
and dropping all of the elements where it is False. That is why the number
of elements changes.

https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Fit 2D points to closed curve

2022-01-20 Thread Robert Kern

On Thu, Jan 20, 2022 at 1:15 PM  wrote:

> Hi,
> I use a CAD package called Rhino which lets users write python scripts
> that run in the cad environment. I have a closed curve that is a design
> surface border and I have a sparse data set of 2D surveyed points taken
> around that border, so some error is included in the 2D data set. I would
> like to know what python function I could use to do a 2D best fit of the
> points to the curve.
> Plenty of examples of fitting curves to points but not of best fitting
> points to a closed curve of which I do not know the formula.
>

You can use a parametric spline (where X and Y are each functions of a
"time" coordinate `t`) with periodic boundary conditions. See this
StackOverflow answer for an example using scipy.interpolate:

https://stackoverflow.com/a/31466013/163633

Whether our spline representation can be converted to Rhino NURBS or if
Rhino itself can do a periodic parametric NURBS and do the fitting itself,
I don't know.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: np.convolve question

2022-01-13 Thread Robert Kern

On Thu, Jan 13, 2022 at 10:47 AM  wrote:

> Hello,
>
> I am a mechanical engineer and am using Numpy Convolve to smoothen 1D data
> collected from physical testing for a diesel engine. (I found convolve the
> best after comparing with other smoothing methodologies)
> I tried reading the numpy manual and the wikipedia article referenced in
> it.
> But I would specifically like to ask/discuss what would happen if the
> second 1D input is not given?

The second input is required, which is evident from the docstring and could
be ascertained by just trying it. Are you instead asking rather if
np.convolve() could be modified to not have to take the second argument?

If so, no, I don't think we would. There are no sensible defaults.
Convolution is inherently a very general operation that combines two inputs
to achieve a variety of effects, not just smoothing.

what type of window will the function pass the data through to smoothen it?

There are several ways to build a low-pass (i.e. smoothing) filter. My
favorite first choice is the Savitzky-Golay filter as mentioned several
times in the StackOverflow answers that you link to. It constructs a
low-pass filter window (i.e. the second 1D input) according to some simple,
understandable design rules and then uses convolution. I recommend starting
with `scipy.signal.savgol_filter()`. If you need to reuse the same filter
window for lots of signals, you can use `scipy.signal.savgol_coeffs()` to
get that window first, and then use `np.convolve()` explicitly.

Here are the articles I have read to implement my smoothing code:
> References:
> 1. https://numpy.org/doc/stable/reference/generated/numpy.convolve.html
> 2. https://en.wikipedia.org/wiki/Convolution
> 3.
> https://stackoverflow.com/questions/20618804/how-to-smooth-a-curve-in-the-right-way

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: representation of valid float type range

2022-01-06 Thread Robert Kern

On Thu, Jan 6, 2022 at 4:43 PM  wrote:

> Thanks for your answer.
>
> I think i understand it - is it that `f64_info.max - f64_info.min` does
> not fit in float64? because it approximates `2 * f64_info.max`?
>

Well, it "fits" into a float64 by becoming `inf`, which is a valid float64
value, just one that has particular consequences.

In that case, I agree with Klaus, linspace should be able to handle this?
>

I don't particularly agree that linspace() ought to add special-case logic
to handle this. It would be difficult to write logic that reliably
recognizes all the ways that something like this is actually the case and
then does something sensible to recover from it. Lev showed how you can
manually do something sorta reasonable for `np.linspace(f64_info.min,
f64_info.max, num)` because the endpoints are symmetric around 0, but there
is nothing that can really be done for the asymmetric case.

Floating point arithmetic has its limits, and playing with values near the
boundaries is going to make you run into those limits. I would rather have
linspace() do something consistent that gives non-useful results in these
boundary edge cases than try to do a bunch of different logic in these
outer limits. The utility of getting "sensible" results for the extreme
results is fairly limited (not least because any downstream computation
will also be very likely to generate NaNs and infs), so I would happily
trade that to have a more consistent mental model about what linspace()
does.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Proposal - Making ndarray object JSON serializable via standardized JData annotations

2021-11-25 Thread Robert Kern

On Thu, Nov 25, 2021 at 11:58 PM Qianqian Fang  wrote:

> from my limited experience, as long as the "TypeError" from json keeps
> popping up, requests like this ( #12481, #16432, #18994,
> pallets/flask#4012, openmm/openmm#3202, zarr-developers/zarr-python#354)
> will unlikely to cease (and maintainers will have to keep on closing with
> "wontfix") - after all, no matter how different the format expectations a
> user may have, seeing some sort of default behavior is still a lot more
> satisfying than seeing an error.
>

I agree. Unfortunately, we have no control over that inside of numpy.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Proposal - Making ndarray object JSON serializable via standardized JData annotations

2021-11-25 Thread Robert Kern

On Thu, Nov 25, 2021 at 10:21 PM Qianqian Fang  wrote:

> On 11/25/21 17:05, Stephan Hoyer wrote:
>
> Hi Qianqian,
>
> What is your concrete proposal for NumPy here?
>
> Are you suggesting new methods or functions like to_json/from_json in
> NumPy itself?
>
>
> that would work - either define a subclass of JSONEncoder to serialize
> ndarray and allow users to pass it to cls in json.dump, or, as you
> mentioned, define to_json/from_json like pandas DataFrame
> <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html>
> would save people from writing customized codes/formats.
>
> I am also wondering if there is a more automated way to tell
> json.dump/dumps to use a default serializer for ndarray without using
> cls=...? I saw a SO post mentioned about a method called "__serialize__" in
> a class, but can't find it in the official doc. I am wondering if anyone is
> aware of the method defining a default json serializer in an object?
>
There isn't one. You have to explicitly provide the JSONEncoder. Which is
why there is nothing that we can really do in numpy to avoid the TypeError
that you mention below. The stdlib json module just doesn't give us the
hooks to be able to do that. We can provide top-level functions like
to_json()/from_json() to encode/decode a top-level ndarray to a JSON text,
but that doesn't help with ndarrays in dicts or other objects. We could
also provide a JSONEncoder/JSONDecoder pair, too, but as I mention in one
of the Github issues you link to, there are a number of different
expectations that people could have for what the JSON representation of an
array is. Some will want to use the JData standard. Others might just want
the arrays to be represented as lists of lists of plain-old JSON numbers in
order to talk with software in other languages that have no particular
standard for array data.

> As far as I can tell, reading/writing in your custom JSON format already
> works with your jdata library.
>
> ideally, I was hoping the small jdata encoder/decoder functions can be
> integrated into numpy; it can help avoid the "TypeError: Object of type
> ndarray is not JSON serializable" in json.dump/dumps without needing
> additional modules; more importantly, it simplifies users experience in
> exchanging complex arrays (complex valued, sparse, special shapes) with
> other programming environments.
>
It seems to me that the jdata package is the right place for implementing
the JData standard. I'm happy for our documentation to point to it in all
the places that we talk about serialization of arrays. If the json module
did have some way for us to specify a default representation for our
objects, then that would be a different matter. But for the present
circumstances, I'm not seeing a substantial benefit to moving this code
inside of numpy. Outside of numpy, you can evolve the JData standard at its
own pace.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: dtype=(bool) vs dtype=bool

2021-10-19 Thread Robert Kern

On Tue, Oct 19, 2021 at 8:54 PM Hongyi Zhao  wrote:

> On Wed, Oct 20, 2021 at 8:29 AM Robert Kern  wrote:
> >
> > On Tue, Oct 19, 2021 at 8:22 PM  wrote:
> >>
> >> Do I have to use it this way?
> >
> >
> > Nothing is forcing you to, but everyone else will write it as
> `dtype=bool`, not `dtype=(bool)`. `dtype=(bool)` is perfectly
> syntactically-valid Python. It's just not idiomatic, so readers of your
> code will wonder why you wrote it that way and if you meant something else
> and will have trouble reading your code.
>
> I use Emacs, and find that python-lsp-server will give the dtype=() as
> the completion result, see here [1] for more detailed discussion.
>
> [1] https://github.com/python-lsp/python-lsp-server/issues/98


Well, that's not helpful of python-lsp-server. Is it looking at our type
annotations? The annotations for `dtype=` arguments are likely quite
complicated because we accept so many different types here, but `()` is not
a great completion.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: dtype=(bool) vs dtype=bool

2021-10-19 Thread Robert Kern

On Tue, Oct 19, 2021 at 8:22 PM  wrote:

> Do I have to use it this way?
>

Nothing is forcing you to, but everyone else will write it as `dtype=bool`,
not `dtype=(bool)`. `dtype=(bool)` is perfectly syntactically-valid Python.
It's just not idiomatic, so readers of your code will wonder why you wrote
it that way and if you meant something else and will have trouble reading
your code.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: The source code corresponding to numpy.invert.

2021-10-04 Thread Robert Kern

On Mon, Oct 4, 2021 at 5:17 AM Hongyi Zhao  wrote:

>
> > That’s just the way Python’s syntax works. Operators are not names that
> can be resolved to objects that can be compared with the `is` operator.
> Instead, when that operator is evaluated in an expression, the Python
> interpreter will look up a specially-named method on the operand object (in
> this case `__invert__`). Numpy array objects implement this method using
> `np.invert`.
>
> If so, which is symlink to which, I mean, which is the original name,
> and which is an alias?
>

"symlink" and "alias" are probably not the best analogies. The
implementation of `np.ndarry.__invert__` simply calls `np.invert` to do the
actual computation.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: The source code corresponding to numpy.invert.

2021-10-04 Thread Robert Kern

On Mon, Oct 4, 2021 at 1:09 AM  wrote:

> Thank you for pointing this out. This is the code block which includes the
> first appearance of the keyword `logical_not`.
>
> BTW, why can't the ~ operator be tested equal to 'np.invert', as shown
> below:
>
> ```
> In [1]: import numpy as np
> In [3]: np.invert is np.bitwise_not
> Out[3]: True
>
> In [4]: np.invert is ~
>   File "", line 1
> np.invert is ~
>   ^
> SyntaxError: invalid syntax
> ```
>

That’s just the way Python’s syntax works. Operators are not names that can
be resolved to objects that can be compared with the `is` operator.
Instead, when that operator is evaluated in an expression, the Python
interpreter will look up a specially-named method on the operand object (in
this case `__invert__`). Numpy array objects implement this method using
`np.invert`.
-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: The source code corresponding to numpy.invert.

2021-10-03 Thread Robert Kern

On Mon, Oct 4, 2021 at 12:09 AM  wrote:

> > (the bool implementation uses the `logical_not` loop).
>
> Do you the following code snippet:
>
>
> https://github.com/numpy/numpy/blob/3c1e9b4717b2eb33a2bf2d495006bc300f5b8765/numpy/core/src/umath/loops.c.src#L1627-L1633


 This is the one that gets expaneded to `BOOL_logical_not`:

https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/loops.c.src#L493-L510

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: The source code corresponding to numpy.invert.

2021-10-03 Thread Robert Kern

On Sun, Oct 3, 2021 at 9:27 PM  wrote:

>
> So, C/Python operator `~` has been overridden by the corresponding user
> function in numpy, but where is the corresponding source code
> implementation?
>

ufuncs are implemented in C. We provide so-called loop functions that
iterate over contiguous segments of the operand arrays. We use a custom
code generator to make implementations for all of the types. Here are the
ones that correspond to `np.invert`. (the bool implementation uses the
`logical_not` loop).

https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/loops.c.src#L612-L626

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: spam on the mailing lists

2021-10-01 Thread Robert Kern

On Wed, Sep 29, 2021 at 3:35 AM Andras Deak  wrote:

> Hi All,
>
> Today both of the python.org mailing lists I'm subscribed to (numpy and
> scipy-dev) got the same kind of link shortener spam. I assume all the
> mailing lists started getting these, and that these won't go away for a
> while.
>

Given that we've had a literal order of magnitude more messages about the
spam than the spam itself, maybe it's just a blip?

I will suggest that spam management is probably not a strong, much less a
decisive, argument for migrating to a new discussion forum.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Running numpy.test() after pip install

2021-09-29 Thread Robert Kern

On Wed, Sep 29, 2021 at 3:50 PM Ralf Gommers  wrote:

>
> On Tue, Sep 28, 2021 at 10:39 PM Aaron Meurer  wrote:
>
>> Could numpy include a variant (I'm not sure what setuptools calls
>> this) so that 'pip install numpy[tests]' installs those extra
>> dependencies?
>>
>
> I'd prefer not to, those things aren't great for maintenance/testing, and
> if you have to read the docs to know how to spell `numpy[tests]` you may
> just as well write it `pip install numpy pytest hypothesis`. There's also
> no one set of extra dependencies people will want - what about running
> benchmarks, optional test dependencies, etc.?
>

I think Bennet's request to add this to the `long_description` that shows
up on PyPI is eminently reasonable, though.

>> If adding them as dependencies seems to heavy weight, or is otherwised
>> >> deemed undesirable, perhaps just a note in the Project Description at
>> >> https://pypi.org/project/numpy/ to say that, if you want to run tests,
>> >> those two packages will be needed?
>> >>
>> >> Thanks,-- bennet
>
>
-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-29 Thread Robert Kern

On Sun, Aug 29, 2021 at 6:58 AM Stig Korsnes  wrote:

> Thanks again Robert!
> Got rid of dict(state).
>
> Not sure I followed you completely on the test case.
>

In the code that you showed, you were pulling out and storing the `.state`
dict and then punching that back into a single `Generator` instance.
Instead, you can just make the ~200-1000 `Generator` instances.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-28 Thread Robert Kern

On Sat, Aug 28, 2021 at 5:56 AM Stig Korsnes  wrote:

> Thank you again Robert.
> I am using NamedTuple for mye keys, which also are keys in a dictionary.
> Each key will be unique (tuple on distinct int and enum), so I am thinking
> maybe the risk of producing duplicate hash is not present, but could as
> always be wrong :)
>

Present, but possibly ignorably small. 128-bit spaces give enough breathing
room for me to be comfortable; 64-bit spaces like what hash() will use for
its results makes me just a little claustrophobic.

If the structure of the keys is pretty fixed, just these two integers
(counting the enum as an integer), then I might just use both in the
seeding material.

def get_key_seed(key:ComponentId, root_seed:int):
return np.random.SeedSequence([key.the_int, int(key.the_enum),
root_seed])

> For positive ints i followed this tip
> https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function
> , and did:
>
> def stronghash(key:ComponentId):
> return ctypes.c_size_t(hash(key)).value
>

np.uint64(possibly_negative_integer) will also work for this purpose
(somewhat more reliably).

Since I will be using each process/random sample several times, and keeping
> all of them in memory at once is not feasible (dimensionality) i did the
> following:
>
> self._rng = default_rng(cs)
> self._state = dict(self._rng.bit_generator.state)  #
>
> def scenarios(self) -> npt.NDArray[np.float64]:
> self._rng.bit_generator.state = self._state
>
>   return 
>
> Would you consider this bad practice, or an ok solution?
>

It's what that property is there for. No need to copy; `.state` creates a
new dict each time.

In a quick test, I measured a process with 1 million Generator instances to
use ~1.5 GiB while 1 million state dicts ~1.0 GiB (including all of the
other overhead of Python and numpy; not a scientific test). Storing just
the BitGenerator is half-way in between. That's something, but not a huge
win. If that is really crossing the border from feasible to infeasible, you
may be about to run into your limits anyways for other reasons. So balance
that out with the complications of swapping state in and out of a single
instance.

I Norway we have a saying which directly translates :" He asked for the
> finger... and took the whole arm" .
>

Well, when I craft an overly-complicated system, I feel responsible to help
shepherd people along in using it well. :-)

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-27 Thread Robert Kern

joblib is a library that uses clever caching of function call results to
make the development of certain kinds of data-heavy computational pipelines
easier. In order to derive the key to be used to check the cache, joblib
has to look at the arguments passed to the function, which may
involve usually-nonhashable things like large numpy arrays.

  https://joblib.readthedocs.io/en/latest/

So they constructed joblib.hash() which basically takes the arguments,
pickles them into a bytestring (with some implementation details), then
computes an MD5 hash on that. It's probably overkill for your keys, but
it's easily available and quite generic. It returns a hex-encoded string of
the 128-bit MD5 hash. `int(..., 16)` will convert that to a non-negative
(almost-certainly positive!) integer that can be fed into SeedSequence.

On Fri, Aug 27, 2021 at 5:03 AM Stig Korsnes  wrote:

> Thank you Robert!
> This scheme fits perfectly into what I`m trying to accomplish! :) The
> "smooshing" of ints by supplying a list of ints had eluded me. Thank you
> also for the pointer about built-in hash(). I would not be able to rely on
> it anyways, because it does not return strictly positive ints which
> SeedSequence requires.  If you have a minute to spare: Could you briefly
> explain "int(joblib.hash(key)
> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
> 16)" , and would this always return non-negative integers?
> Thanks again!
>
> tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern :
>
>> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes 
>> wrote:
>>
>>> Hi,
>>> Is there a way to uniquely spawn child seeds?
>>> I`m doing monte carlo analysis, where I have n random processes, each
>>> with their own generator.
>>> All process models instantiate a generator with default_rng(). I.e
>>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
>>> problem I`m facing, is that results using individual process  depends on
>>> the order of the process initialization ,and the number of processes used.
>>> However, if I could spawn children with a unique identifier, I would be
>>> able to reproduce my individual results without having to pickle/log
>>> states. For example, all my models have an id (tuple) field which is
>>> hashable.
>>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
>>> support hash(object), I would have reproducibility for all my processes. I
>>> could do without the spawning, but then I would probably loose independence
>>> when I do multiproc? Is there a way to achieve my goal in the current
>>> version 1.21 of numpy?
>>>
>>
>> I would probably not rely on `hash()` as it is only intended to be pretty
>> good at getting distinct values from distinct inputs. If you can combine
>> the tuple objects into a string of bytes in a reliable, collision-free way
>> and use one of the cryptographic hashes to get them down to a 128bit
>> number, that'd be ideal. `int(joblib.hash(key)
>> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
>> 16)` should do nicely. You can combine that with your main process's seed
>> easily. SeedSequence can take arbitrary amounts of integer data and smoosh
>> them all together. The spawning functionality builds off of that, but you
>> can also just manually pass in lists of integers.
>>
>> Let's call that function `stronghash()`. Let's call your main process
>> seed number `seed` (this is the thing that the user can set on the
>> command-line or something you get from `secrets.randbits(128)` if you need
>> a fresh one). Let's call the unique tuple `key`. You can build the
>> `SeedSequence` for each job according to the `key` like so:
>>
>> root_ss = SeedSequence(seed)
>> for key, data in jobs:
>> child_ss = SeedSequence([stronghash(key), seed])
>> submit_job(key, data, seed=child_ss)
>>
>> Now each job will get its own unique stream regardless of the order the
>> job is assigned. When the user reruns it with the same root `seed`, they
>> will get the same results. When the user chooses a different `seed`, they
>> will get another set of results (this is why you don't want to just use
>> `SeedSequence(stronghash(key))` all by itself).
>>
>> I put the job-specific seed data ahead of the main program's seed to be
>> on the super-safe side. The spawning mechanism will append integers to the
>> end, so there's a super-tiny chance somewhere down a long line of
>> `root_ss.spawn()`s that there would be a collision (and I mean
>> super-extra-tiny). But best practices cost nothing.
>>
>> I hope that helps and is not too confusing!
>>
>> --
>> Robert Kern
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] SeedSequence.spawn()

2021-08-26 Thread Robert Kern

On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes  wrote:

> Hi,
> Is there a way to uniquely spawn child seeds?
> I`m doing monte carlo analysis, where I have n random processes, each with
> their own generator.
> All process models instantiate a generator with default_rng(). I.e
> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
> problem I`m facing, is that results using individual process  depends on
> the order of the process initialization ,and the number of processes used.
> However, if I could spawn children with a unique identifier, I would be
> able to reproduce my individual results without having to pickle/log
> states. For example, all my models have an id (tuple) field which is
> hashable.
> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
> support hash(object), I would have reproducibility for all my processes. I
> could do without the spawning, but then I would probably loose independence
> when I do multiproc? Is there a way to achieve my goal in the current
> version 1.21 of numpy?
>

I would probably not rely on `hash()` as it is only intended to be pretty
good at getting distinct values from distinct inputs. If you can combine
the tuple objects into a string of bytes in a reliable, collision-free way
and use one of the cryptographic hashes to get them down to a 128bit
number, that'd be ideal. `int(joblib.hash(key)
<https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>, 16)`
should do nicely. You can combine that with your main process's seed
easily. SeedSequence can take arbitrary amounts of integer data and smoosh
them all together. The spawning functionality builds off of that, but you
can also just manually pass in lists of integers.

Let's call that function `stronghash()`. Let's call your main process seed
number `seed` (this is the thing that the user can set on the command-line
or something you get from `secrets.randbits(128)` if you need a fresh one).
Let's call the unique tuple `key`. You can build the `SeedSequence` for
each job according to the `key` like so:

root_ss = SeedSequence(seed)
for key, data in jobs:
child_ss = SeedSequence([stronghash(key), seed])
submit_job(key, data, seed=child_ss)

Now each job will get its own unique stream regardless of the order the job
is assigned. When the user reruns it with the same root `seed`, they will
get the same results. When the user chooses a different `seed`, they will
get another set of results (this is why you don't want to just use
`SeedSequence(stronghash(key))` all by itself).

I put the job-specific seed data ahead of the main program's seed to be on
the super-safe side. The spawning mechanism will append integers to the
end, so there's a super-tiny chance somewhere down a long line of
`root_ss.spawn()`s that there would be a collision (and I mean
super-extra-tiny). But best practices cost nothing.

I hope that helps and is not too confusing!

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] sinpi/cospi trigonometric functions

2021-07-14 Thread Robert Kern

On Wed, Jul 14, 2021 at 12:26 PM Joshua Wilson 
wrote:

> I'll note that SciPy actually does have `sinpi` and `cospi`-they just
> happen to be private:
>
> https://github.com/scipy/scipy/blob/master/scipy/special/functions.json#L58
> https://github.com/scipy/scipy/blob/master/scipy/special/functions.json#L12
>
> They are used extensively inside the module though as helpers in other
> special functions and have extensive tests of their own:
>
>
> https://github.com/scipy/scipy/blob/master/scipy/special/tests/test_mpmath.py#L533
>
> https://github.com/scipy/scipy/blob/master/scipy/special/tests/test_mpmath.py#L547
>
> https://github.com/scipy/scipy/blob/master/scipy/special/tests/test_mpmath.py#L960
>
> https://github.com/scipy/scipy/blob/master/scipy/special/tests/test_mpmath.py#L1741
>
> I have no objections to making them public; the PR is as simple as
> removing the underscores and adding a docstring.
>

Delightful!

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] sinpi/cospi trigonometric functions

2021-07-14 Thread Robert Kern

On Wed, Jul 14, 2021 at 11:39 AM Tom Programming 
wrote:

> Hi all,
>
> (I am very new to this mail list so please cut me some slack)
>
> trigonometric functions like sin(x) are usually implemented as:
>
> 1. Some very complicated function that does bit twiddling and basically
> computes the reminder of x by pi/2. An example in
> http://www.netlib.org/fdlibm/e_rem_pio2.c (that calls
> http://www.netlib.org/fdlibm/k_rem_pio2.c ). i.e. ~500 lines of branching
> C code. The complexity arises in part because for big values of x the
> subtraction becomes more and more ill defined, due to x being represented
> in binary base to which an irrational number has to subtracted and
> consecutive floating point values being more and more apart for higher
> absolute values.
> 2. A Taylor series for the small values of x,
> 3. Plus some manipulation to get the correct branch, deal with subnormal
> numbers, deal with -0, etc...
>
> If we used a function like sinpi(x) = sin(pi*x) part (1) can be greatly
> simplified, since it becomes trivial to separate the reminder of the
> division by pi/2. There are gains both in the accuracy and the performance.
>
> In large parts of the code anyways there is a pi inside the argument of
> sin since it is common to compute something like sin(2*pi*f*t) etc. So I
> wonder if it is feasible to implement those functions in numpy.
>
> To strengthen my argument I'll note that the IEEE standard, too, defines (
> https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf ) the functions
> sinPi, cosPi, tanPi, atanPi, atan2Pi. And there are existing
> implementations, for example, in Julia (
> https://github.com/JuliaLang/julia/blob/6aaedecc447e3d8226d5027fb13d0c3cbfbfea2a/base/special/trig.jl#L741-L745
> ) and the Boost C++ Math library (
> https://www.boost.org/doc/libs/1_54_0/boost/math/special_functions/sin_pi.hpp
> )
>
> And that issue caused by apparently inexact calculations have been raised
> in the past in various forums (
> https://stackoverflow.com/questions/20903384/numpy-sinpi-returns-negative-value
> https://stackoverflow.com/questions/51425732/how-to-make-sinpi-and-cospi-2-zero
> https://www.reddit.com/r/Python/comments/2g99wa/why_does_python_not_make_sinpi_0_just_really/
> ... )
>
> PS: to be nitpicky I see that most implementation implement sinpi as
> sin(pi*x) for small values of x, i.e. they multiply x by pi and then use
> the same coefficients for the Taylor series as the canonical sin. A
> multiply instruction could be spared, in my opinion, by storing different
> Taylor expansion number coefficients tailored for the sinpi function. It is
> not clear to me if it is not done because the performance gain is small,
> because I am wrong about something, or because those 6 constants of the
> Taylor expansion have a "sacred aura" about them and nobody wants to enter
> deeply into this.
>

The main value of the sinpi(x) formulation is that you can do the reduction
on x more accurately than on pi*x (reduce-then-multiply rather than
multiply-then-reduce) for people who particularly care about the special
locations of half-integer x. sin() and cos() are often not implemented in
software, but by CPU instructions, so you don't want to reimplement them.
There is likely not a large accuracy win by removing the final
multiplication.

We do have sindg(), cosdg(), and tandg() in scipy.special that work
similarly for inputs in degrees rather than radians. They also follow the
reduce-then-multiply strategy. scipy.special would be a good place for
sinpi() and friends.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Interface for wrapping random distribution functions (__array_function, ufunc__, ?)

2021-06-30 Thread Robert Kern

On Wed, Jun 30, 2021 at 3:34 PM Yoann Mocquin  wrote:

>
>
> Hello there,
>
> here is a feature request for the possibility to wrap numpy random
> distributions to be wrapped by a mechanism like __array_function__ or
> __array_ufunc__. You can find the GH issue at :
> https://github.com/numpy/numpy/issues/19382.
>
>
>
> This post follows [this one](https://github.com/numpy/numpy/issues/18902).
> I figured I should open another issue for `np.random.normal`, but this
> applies for all distributions I guess.
>
> ## Feature
>
> Basicaly, I would like that `numpy.random.*` distributions could trigger
> an interface when passed instances from custom classes, "à la"
> `__array_ufunc__` or `__array_function__`. Here is an example concept :
>

 The main problem is that these are not functions but methods on a hidden
global `RandomState` instance. I haven't kept up fully with all of the
efforts in the `__array_*__` proposals, but I do see that methods are
explicitly excluded from the `__array_function__` mechanism:

  https://numpy.org/neps/nep-0018-array-function-protocol.html#non-goals

The `RandomState` functionality (and thus all of these aliases in
`numpy.random`) are now frozen in functionality (per NEP 19), so we will
not be adding this functionality to them. If the `__array_function__`
developments eventually work out a good way to wrap methods, then we can
think about using that on `Generator`, but I suspect that will not be
straightforward.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'

2021-05-29 Thread Robert Kern

On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi  wrote:

> What does k stand for here? As someone that never encountered this
> function before I find both names equally confusing. If I understand
> what the function is supposed to be doing, I think largest() would be
> much more descriptive.
>

`k` is the number of elements to return. `largest()` can connote that it's
only returning the one largest value. It's fairly typical to include a
dummy variable (`k` or `n`) in the name to indicate that the function lets
you specify how many you want. See, for example, the stdlib `heapq`
module's `nlargest()` function.

https://docs.python.org/3/library/heapq.html#heapq.nlargest

"top-k" comes from the ML community where this function is used to evaluate
classification models (`k` instead of `n` being largely an accident of
history, I imagine). In many classification problems, the number of classes
is very large, and they are very related to each other. For example,
ImageNet has a lot of different dog breeds broken out as separate classes.
In order to get a more balanced view of the relative performance of the
classification models, you often want to check whether the correct class is
in the top 5 classes (or whatever `k` is appropriate) that the model
predicted for the example, not just the one class that the model says is
the most likely. "5 largest" doesn't really work in the sentences that one
usually writes when talking about ML classifiers; they are talking about
the 5 classes that are associated with the 5 largest values from the
predictor, not the values themselves. So "top k" is what gets used in ML
discussions, and that transfers over to the name of the function in ML
libraries.

It is a top-down reflection of the higher level thing that people want to
compute (in that context) rather than a bottom-up description of how the
function is manipulating the input, if that makes sense. Either one is a
valid way to name things. There is a lot to be said for numpy's
domain-agnostic nature that we should prefer the bottom-up description
style of naming. However, we are also in the midst of a diversifying
ecosystem of array libraries, largely driven by the ML domain, and adopting
some of that terminology when we try to enhance our interoperability with
those libraries is also a factor to be considered.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Indexing question

2021-05-20 Thread Robert Kern

On Thu, May 20, 2021 at 1:40 PM CJ Carey  wrote:

> Or as a one-liner:
>
> out[np.arange(len(x)), x] = 1
>

Ah, right. `x[arange(len(x))]` is a no-op.

> If NEP 21 is accepted (
> https://numpy.org/neps/nep-0021-advanced-indexing.html) this would be
> even simpler:
>
> out.vindex[:, x] = 1
>
> Was there ever a decision about that NEP? I didn't follow the discussion
> too closely at the time.
>

IIRC, I think there was broad agreement on the final plan as stated in the
NEP. I suspect, though, that the general appetite for adding to the array
API surface has declined even from its anemic starting point, now that deep
learning frameworks with ndarray-mimicking APIs have taken off.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Indexing question

2021-05-20 Thread Robert Kern

On Thu, May 20, 2021 at 9:47 AM Neal Becker  wrote:

> This seems like something that can be done with indexing, but I
> haven't found the solution.
>
> out is a 2D array is initialized to zeros.  x is a 1D array whose
> values correspond to the columns of out.  For each row in out, set
> out[row,x[row]] = 1.  Here is working code:
> def orthogonal_mod (x, nbits):
> out = np.zeros ((len(x), 1< for e in range (len (x)):
> out[e,x[e]] = 1
> return out
>
> Any idea to do this without an explicit python loop?
>


i = np.arange(len(x))
j = x[i]
out[i, j] = 1

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] two questions about `choose`

2021-04-19 Thread Robert Kern

On Mon, Apr 19, 2021 at 12:24 PM bas van beek 
wrote:

> After a taking a closer look at the `np.choose` docs I think we should
> change `*choices: npt.ArrayLike` into `choices: Sequence[npt.ArrayLike]`.*
>
>
>
> *This would resolve the issue, but it’d also mean that directly passing an
> array will be prohibited (as far as type checkers are concerned).*
>
> *The docs do mention that the outermost container should be a list or
> tuple anyway, so I’m not convinced that this new typing restriction would
> be a huge loss.*
>

Why not allow both?

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] two questions about `choose`

2021-04-18 Thread Robert Kern

On Sat, Apr 17, 2021 at 4:28 PM Kevin Sheppard 
wrote:

> 1. I suppose it only uses the (Native int or int64) dtype since each one
> would need a code path to run quickly.
>
> 2. I would describe this a a bug. I think sequences are converted to
> arrays and in this case the conversion is not returning a 2 element object
> array but expanding and then concatenation.
>

No, it's broadcasting the two to a common shape, as documented.
https://numpy.org/doc/stable/reference/generated/numpy.choose.html


> E.g.,
>> a = a = (0,1,1,0,0,0,1,1)  #binary array
>> np.choose(a, (0,range(8)) #array([0, 1, 2, 0, 0, 0, 6, 7])
>>
>
This is equivalent to `np.choose(a, (np.zeros(8, dtype=int), range(8)))`

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp

2021-04-14 Thread Robert Kern

On Wed, Apr 14, 2021 at 6:16 PM Andrew Nelson  wrote:

> On Thu, 15 Apr 2021 at 07:15, Robert Kern  wrote:
>
>> On Wed, Apr 14, 2021 at 4:37 PM Joachim Wuttke 
>> wrote:
>>
>>> Regarding numpy, I'd propose a bolder measure:
>>> To let savetxt(fname, X, ...) store exactly the same information in
>>> compressed and uncompressed files, always invoke gzip with mtime = 0.
>>>
>>
>> I agree.
>>
>
> I might look into making a PR for this. To be clear what would the desired
> functionality be:
>
> 1. Mandatory to have mtime = 0?
>
> 2. Default mtime = 0, but `np.save*` has an extra `mtime` kwd that allows
> to set the mtime?
>
> 3. Default mtime = time.time(), but `np.save*` has an extra `mtime` kwd
> that allows to set the mtime = 0?
>
>
> As Joachim says for testing/git-related purposes it is nice to have
> bit-wise unchanged files produced (such that the file-hash is unchanged),
> but I can also see that it might be nice to have a modification time when
> files contained in a zip file were last changed (e.g. write with numpy,
> check/inspect with a different module). Of course with the latter you could
> just look at the overall file-write date, they should be the same.
>

I suspect no one's actually looking at the timestamp inside the file
(relevant XKCD comic[1] notwithstanding). I'd lean towards boldness here
and just set mtime=0. If someone screams, then we can add in the option for
them to get the old functionality. But moving closer to reliably
reproducible file formats is a goal worth changing the default behavior.

[1] https://xkcd.com/1172/

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp

2021-04-14 Thread Robert Kern

On Wed, Apr 14, 2021 at 6:20 PM Derek Homeier <
de...@astro.physik.uni-goettingen.de> wrote:

> On 14 Apr 2021, at 11:15 pm, Robert Kern  wrote:
> >
> > On Wed, Apr 14, 2021 at 4:37 PM Joachim Wuttke 
> wrote:
> > Regarding numpy, I'd propose a bolder measure:
> > To let savetxt(fname, X, ...) store exactly the same information in
> > compressed and uncompressed files, always invoke gzip with mtime = 0.
> >
> > I agree.
> >
> I would caution though that relying on the checksum or similar of the
> compressed data still
> does not seem a very robust check of the data itself - the compressed file
> would still definitely
> change with any change in compression level, and quite possibly with
> changes in the
> linked compression library (perhaps even a simple version update).
>

Sure, but in practice, that variability happens more rarely. In particular,
I'm not going to see that variability when I'm re-running the same workflow
with the same software versions inside of a system like DVC[1] that uses
file hashes to determine which parts of the workflow need to get rerun.
Whereas, I will see timestamp changes every time and thus invalidating the
cache when it's irrelevant.

[1] https://dvc.org/

> Shouldn’t you better verify the data buffers themselves? Outside of
> Python, the lzma utility
> xzcmp for example allows binary comparison of the content of compressed
> files independently
> from the timestamp or even compression method, and it handles gzip and
> bzip2 files as well.
>

We're often integrating with systems that only know how to compute hashes
of files and doesn't try to dig into the semantics of the files any deeper.
We're not looking to construct the most provably-correct cache, just make
some pragmatic choices that will reduce false cache misses with systems
that currently exist and enable *better*, if still imperfect, uses of those
systems.

> I would like to follow up with a pull request, but I am unable to
> > find out how numpy.savetxt is invoking gzip.
> >
> > `savetxt` uses the abstractions in this module to invoke gzip when the
> filename calls for it:
> >
> > https://github.com/numpy/numpy/blob/main/numpy/lib/_datasource.py#L115
> >
> One obstacle would be that this is setting gzip.open as _file_opener,
> which does not have the
> mtime option; getting this replaced with gzip.GzipFile to work in text
> mode would require to
> somehow replicate gzip.open’s io.TextIOWrapper wrapping of GzipFile.
>

That should be easily doable.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp

2021-04-14 Thread Robert Kern

On Wed, Apr 14, 2021 at 4:37 PM Joachim Wuttke 
wrote:

> Regarding numpy, I'd propose a bolder measure:
> To let savetxt(fname, X, ...) store exactly the same information in
> compressed and uncompressed files, always invoke gzip with mtime = 0.
>

I agree.


> I would like to follow up with a pull request, but I am unable to
> find out how numpy.savetxt is invoking gzip.
>

`savetxt` uses the abstractions in this module to invoke gzip when the
filename calls for it:

https://github.com/numpy/numpy/blob/main/numpy/lib/_datasource.py#L115

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] How to get Boolean matrix for similar lists in two different-size numpy arrays of lists

2021-03-14 Thread Robert Kern

On Sun, Mar 14, 2021 at 3:06 PM Ali Sheikholeslam <
sheikholeslam@gmail.com> wrote:

> I have written a question in:
>
> https://stackoverflow.com/questions/66623145/how-to-get-boolean-matrix-for-similar-lists-in-two-different-size-numpy-arrays-o
> It was recommended by numpy to send this subject to the mailing lists.
>
> The question is as follows. I would be appreciated if you could advise me
> to solve the problem:
>
> At first, I write a small example of to lists:
>
> F = [[1,2,3],[3,2,7],[4,4,1],[5,6,3],[1,3,7]]  # (1*5) 5 lists
> S = [[1,3,7],[6,8,1],[3,2,7]]  # (1*3) 3 lists
>
> I want to get Boolean matrix for the same 'list's in two F and S:
>
> [False, True, False, False, True]  #  (1*5)5 Booleans 
> for 5 lists of F
>
> By using IM = reduce(np.in1d, (F, S)) it gives results for each number in
> each lists of F:
>
> [ True  True  True  True  True  True False False  True False  True  True
>   True  True  True]   # (1*15)
>
> By using IM = reduce(np.isin, (F, S)) it gives results for each number in
> each lists of F, too, but in another shape:
>
> [[ True  True  True]
>  [ True  True  True]
>  [False False  True]
>  [False  True  True]
>  [ True  True  True]]   # (5*3)
>
> The true result will be achieved by code IM = [i in S for i in F] for the
> example lists, but when I'm using this code for my two main bigger numpy
> arrays of lists:
>
>
> https://drive.google.com/file/d/1YUUdqxRu__9-fhE1542xqei-rjB3HOxX/view?usp=sharing
>
> numpy array: 3036 lists
>
>
> https://drive.google.com/file/d/1FrggAa-JoxxoRqRs8NVV_F69DdVdiq_m/view?usp=sharing
>
> numpy array: 300 lists
>
> It gives wrong answer. For the main files it must give 3036 Boolean, in
> which 'True' is only 300 numbers. I didn't understand why this get wrong
> answers?? It seems it applied only on the 3rd characters in each lists of
> F. It is preferred to use reduce function by the two functions, np.in1d and
> np.isin, instead of the last method. How could to solve each of the three
> above methods??
>

Thank you for providing the data. Can you show a complete, runnable code
sample that fails? There are several things that could go wrong here, and
we can't be sure which is which without the exact code that you ran.

In general, you may well have problems with the floating point data that
you are not seeing with your integer examples.

FWIW, I would continue to use something like the `IM = [i in S for i in F]`
list comprehension for data of this size. You aren't getting any benefit
trying to convert to arrays and using our array set operations. They are
written for 1D arrays of numbers, not 2D arrays (attempting to treat them
as 1D arrays of lists) and won't really work on your data.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] size of arrays

2021-03-13 Thread Robert Kern

On Sat, Mar 13, 2021 at 4:18 PM  wrote:

> So is it right that 100 arrays of one element is smaller than one array
> with size of 100 elements?
>

No, typically the opposite is true.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] size of arrays

2021-03-13 Thread Robert Kern

On Sat, Mar 13, 2021 at 4:02 PM  wrote:

> Dear colleagues!
>
> Size of np.float16(1) is 26
> Size of np.float64(1) is 32
> 32 / 26 = 1.23
>

Note that `sys.getsizeof()` is returning the size of the given Python
object in bytes. `np.float16(1)` and `np.float64(1)` are so-called "numpy
scalar objects" that wrap up the raw `float16` (2 bytes) and `float64` (8
bytes) values with the necessary information to make them Python objects.
The extra 24 bytes for each is _not_ present for each value when you have
`float16` and `float64` arrays of larger lengths. There is still some
overhead to make the array of numbers into a Python object, but this does
not increase with the number of array elements. This is what you are seeing
below when you compute the sizes of the Python objects that are the arrays.
The fixed overhead does not increase when you increase the sizes of the
arrays. They eventually approach the ideal ratio of 4: `float64` values
take up 4 times as many bytes as `float16` values, as the names suggest.
The ratio of 1.23 that you get from comparing the scalar objects reflects
that the overhead for making a single value into a Python object takes up
significantly more memory than the actual single number itself.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Using logfactorial instead of loggamma in random_poisson sampler

2021-03-06 Thread Robert Kern

On Sat, Mar 6, 2021 at 1:45 PM Warren Weckesser 
wrote:

> At the time, making that change was not a high priority, so I didn't
> pursue it. It does make sense to use the logfactorial function there,
> and I'd be happy to see it updated, but be aware that making the
> change is more work than changing just the function call.
>

Does it make a big difference? Per NEP 19, even in `Generator`, we do weigh
the cost of changing the stream reasonably highly. Improved accuracy is
likely worthwhile, but a minor performance improvement is probably not.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ENH: Proposal to add atleast_nd function

2021-02-12 Thread Robert Kern

On Fri, Feb 12, 2021 at 3:42 PM Ralf Gommers  wrote:

>
> On Fri, Feb 12, 2021 at 9:21 PM Robert Kern  wrote:
>
>> On Fri, Feb 12, 2021 at 1:47 PM Ralf Gommers 
>> wrote:
>>
>>>
>>> On Fri, Feb 12, 2021 at 7:25 PM Sebastian Berg <
>>> sebast...@sipsolutions.net> wrote:
>>>
>>>>
>>>> Right, my initial feeling it that without such context `atleast_3d` is
>>>> pretty surprising.  So I wonder if we can design `atleast_nd` in a way
>>>> that it is explicit about this context.
>>>>
>>>
>>> Agreed. I think such a use case is probably too specific to design a
>>> single function for, at least in such a hardcoded way.
>>>
>>
>> That might be an argument for not designing a new one (or at least not
>> giving it such a name). Not sure it's a good argument for removing a
>> long-standing one.
>>
>
> I agree. I'm not sure deprecating is best. But introducing new
> functionality where `nd(pos=3) != 3d` is also not great.
>
> At the very least, atleast_3d should be better documented. It also is
> telling that Juan (a long-time) scikit-image dev doesn't like atleast_3d
> and there's very little usage of it in scikit-image.
>

I'm fairly neutral on atleast_nd(). I think that for n=1 and n=2, you can
derive The One Way to Do It from broadcasting semantics, but for n>=3, I'm
not sure there's much value in trying to systematize it to a single
convention. I think that once you get up to those dimensions, you start to
want to have domain-specific semantics. I do agree that, in retrospect,
atleast_3d() probably should have been named more specifically. It was of a
piece of other conveniences like dstack() that did special things to
support channel-last images (and implicitly treat 3D arrays as such). For
example, DL frameworks that assemble channeled images into minibatches
(with different conventions like BHWC and BCHW), you'd want the n=4
behavior to do different things. I _think_ you'd just want to do those with
different functions than a complicated set of arguments to one function.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ENH: Proposal to add atleast_nd function

2021-02-12 Thread Robert Kern

On Fri, Feb 12, 2021 at 1:47 PM Ralf Gommers  wrote:

>
> On Fri, Feb 12, 2021 at 7:25 PM Sebastian Berg 
> wrote:
>
>> On Fri, 2021-02-12 at 10:08 -0500, Robert Kern wrote:
>> > On Fri, Feb 12, 2021 at 9:45 AM Joseph Fox-Rabinovitz <
>> > jfoxrabinov...@gmail.com> wrote:
>> >
>> > >
>> > >
>> > > On Fri, Feb 12, 2021, 09:32 Robert Kern 
>> > > wrote:
>> > >
>> > > > On Fri, Feb 12, 2021 at 5:15 AM Eric Wieser <
>> > > > wieser.eric+nu...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > > There might be some linear algebraic reason why those axis
>> > > > > > positions
>> > > > > make sense, but I’m not aware of it...
>> > > > >
>> > > > > My guess is that the historical motivation was to allow
>> > > > > grayscale `(H,
>> > > > > W)` images to be converted into `(H, W, 1)` images so that they
>> > > > > can be
>> > > > > broadcast against `(H, W, 3)` RGB images.
>> > > > >
>> > > >
>> > > > Correct. If you do introduce atleast_nd(), I'm not sure why you'd
>> > > > deprecate and remove the one existing function that *isn't* made
>> > > > redundant
>> > > > thereby.
>> > > >
>> > >
>> > > `atleast_nd` handles the promotion of 2D to 3D correctly. The `pos`
>> > > argument lets you tell it where to put the new axes. What's
>> > > unintuitive to
>> > > my is that the 1D case gets promoted to from shape `(x,)` to shape
>> > > `(1, x,
>> > > 1)`. It takes two calls to `atleast_nd` to replicate that behavior.
>> > >
>> >
>> > When thinking about channeled images, the channel axis is not of the
>> > same
>> > kind as the H and W axes. Really, you tend to want to think about an
>> > RGB
>> > image as a (H, W) array of colors rather than an (H, W, 3) ndarray of
>> > intensity values. As much as possible, you want to treat RGB images
>> > similar
>> > to (H, W)-shaped grayscale images. Let's say I want to make a
>> > separable
>> > filter to convolve with my image, that is, we have a 1D filter for
>> > each of
>> > the H and W axes, and they are repeated for each channel, if RGB.
>> > Setting
>> > up a separable filter for (H, W) grayscale is straightforward with
>> > broadcasting semantics. I can use (ntaps,)-shaped vector for the W
>> > axis and
>> > (ntaps, 1)-shaped filter for the H axis. Now, when I go to the RGB
>> > case, I
>> > want the same thing. atleast_3d() adapts those correctly for the (H,
>> > W,
>> > nchannels) case.
>>
>> Right, my initial feeling it that without such context `atleast_3d` is
>> pretty surprising.  So I wonder if we can design `atleast_nd` in a way
>> that it is explicit about this context.
>>
>
> Agreed. I think such a use case is probably too specific to design a
> single function for, at least in such a hardcoded way.
>

That might be an argument for not designing a new one (or at least not
giving it such a name). Not sure it's a good argument for removing a
long-standing one.

Broadcasting is a very powerful convention that makes coding with arrays
tolerable. It makes some choices (namely, prepending 1s to the shape) to
make some common operations with mixed-dimension arrays work "by default".
But it doesn't cover all of the desired operations conveniently.
atleast_3d() bridges the gap to an important convention for a major
use-case of arrays.

There's also "channels first" and "channels last" versions of RGB images as
> 3-D arrays, and "channels first" is the default in most deep learning
> frameworks - so the choice atleast_3d makes is a little outdated by now.
>

DL frameworks do not constitute the majority of image processing code,
which has a very strong channels-last contingent. But nonetheless, the very
popular Tensorflow defaults to channels-last.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ENH: Proposal to add atleast_nd function

2021-02-12 Thread Robert Kern

On Fri, Feb 12, 2021 at 9:45 AM Joseph Fox-Rabinovitz <
jfoxrabinov...@gmail.com> wrote:

>
>
> On Fri, Feb 12, 2021, 09:32 Robert Kern  wrote:
>
>> On Fri, Feb 12, 2021 at 5:15 AM Eric Wieser 
>> wrote:
>>
>>> > There might be some linear algebraic reason why those axis positions
>>> make sense, but I’m not aware of it...
>>>
>>> My guess is that the historical motivation was to allow grayscale `(H,
>>> W)` images to be converted into `(H, W, 1)` images so that they can be
>>> broadcast against `(H, W, 3)` RGB images.
>>>
>>
>> Correct. If you do introduce atleast_nd(), I'm not sure why you'd
>> deprecate and remove the one existing function that *isn't* made redundant
>> thereby.
>>
>
> `atleast_nd` handles the promotion of 2D to 3D correctly. The `pos`
> argument lets you tell it where to put the new axes. What's unintuitive to
> my is that the 1D case gets promoted to from shape `(x,)` to shape `(1, x,
> 1)`. It takes two calls to `atleast_nd` to replicate that behavior.
>

When thinking about channeled images, the channel axis is not of the same
kind as the H and W axes. Really, you tend to want to think about an RGB
image as a (H, W) array of colors rather than an (H, W, 3) ndarray of
intensity values. As much as possible, you want to treat RGB images similar
to (H, W)-shaped grayscale images. Let's say I want to make a separable
filter to convolve with my image, that is, we have a 1D filter for each of
the H and W axes, and they are repeated for each channel, if RGB. Setting
up a separable filter for (H, W) grayscale is straightforward with
broadcasting semantics. I can use (ntaps,)-shaped vector for the W axis and
(ntaps, 1)-shaped filter for the H axis. Now, when I go to the RGB case, I
want the same thing. atleast_3d() adapts those correctly for the (H, W,
nchannels) case.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ENH: Proposal to add atleast_nd function

2021-02-12 Thread Robert Kern

On Fri, Feb 12, 2021 at 5:15 AM Eric Wieser 
wrote:

> > There might be some linear algebraic reason why those axis positions
> make sense, but I’m not aware of it...
>
> My guess is that the historical motivation was to allow grayscale `(H, W)`
> images to be converted into `(H, W, 1)` images so that they can be
> broadcast against `(H, W, 3)` RGB images.
>

Correct. If you do introduce atleast_nd(), I'm not sure why you'd deprecate
and remove the one existing function that *isn't* made redundant thereby.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Question about optimizing random_standard_normal

2021-02-08 Thread Robert Kern

On Mon, Feb 8, 2021 at 12:10 PM Sebastian Berg 
wrote:

>
> This type of change should be in the release notes undoubtedly and
> likely a `.. versionchanged::` directive in the docstring.
>
> Maybe the best thing would be to create a single, prominent but brief,
> changelog listing all (or almost all) stream changes? (Possibly instead
> of documenting it in the individual function as `.. versionchanged::`)
>
> I am thinking just a table with:
>   * version changed
>   * short description
>   * affected functions
>   * how the stream changed (if that is ever relevant)
>

Both are probably useful. Good ideas.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Question about optimizing random_standard_normal

2021-02-08 Thread Robert Kern

On Mon, Feb 8, 2021 at 11:38 AM Kevin Sheppard 
wrote:

> That is good news indeed.  Seems like a good upgrade, especially given the
> breadth of application of normals and the multiple appearances within
> distributions.c (e.g., Cauchy). Is there a deprecation for a change like
> this? Or is it just a note and new random numbers in the next major?  I
> think this is the first time a substantially new algo has replaced an
> existing one.
>

Per NEP 19, a change like this is a new feature that can be included in a
feature release, like any other feature.

I would like to see some more testing on the quality of the sequences more
than the ones already quoted. Using Kolmogorov-Smirnov and/or
Anderson-Darling tests as well, which should be more thorough.

https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L604-L620

There are also some subtle issues involved in ziggurat method
implementations that go beyond just the marginal distributions. I'm not
sure, even, that our current implementation deals with the issues raised in
the following paper, but I'd like to do no worse.

  https://www.doornik.com/research/ziggurat.pdf

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Question about optimizing random_standard_normal

2021-02-08 Thread Robert Kern

On Mon, Feb 8, 2021 at 10:53 AM Kevin Sheppard 
wrote:

> My reading is that the first 4 are pure C, presumably using the standard
> practice of inclining so as to make the tightest loop possible, and to
> allow the compiler to make other optimizations.  The final line is what
> happens when you replace the existing ziggurat in NumPy with the new one. I
> read it this way since it has both “new” and “old” with numpy. If it isn’t
> this, then I’m unsure what “new” and “old” could mean in the context of
> this thread.
>

No, these are our benchmarks of `Generator`. `numpy` is testing
`RandomState`, which wasn't touched by their contribution.


https://github.com/numpy/numpy/blob/master/benchmarks/benchmarks/bench_random.py#L93-L97

https://github.com/numpy/numpy/blob/master/benchmarks/benchmarks/bench_random.py#L123-L124


> I suppose camel-cdr can clarify what was actually done.
>

But I did run the built-in benchmark: ./runtests.py --bench
> bench_random.RNG.time_normal_zig and the results are:
>
>
-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Question about optimizing random_standard_normal

2021-02-08 Thread Robert Kern

On Mon, Feb 8, 2021 at 3:05 AM Kevin Sheppard 
wrote:

> If I understand correctly, there is no gain when applying this patch to
> Generator.  I'm not that surprised that this is the case since the compiler
> is much more limited in when it can do in Generator than what it can when
> filling a large array directly with functions available for inlining and
> unrolling. Again, if I understand correctly I think it will be difficult to
> justify breaking the stream for a negligible gain in performance.
>

Can you explain your understanding of the benchmark results? To me, it
looks like nearly a 2x improvement with the faster BitGenerators (our
default PCG64 and SFC64). That may or may not worth breaking the stream,
but it's far from negligible.

> But I did run the built-in benchmark: ./runtests.py --bench
>> bench_random.RNG.time_normal_zig and the results are:
>>
>>
>>   new   old
>> PCG64  589±3μs 1.06±0.03ms
>> MT19937 985±4μs 1.44±0.01ms
>> Philox 981±30μs1.39±0.01ms
>> SFC64  508±4μs   900±4μs
>> numpy    2.99±0.06ms   2.98±0.01ms # no change for /dev/urandom
>>
>
-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Question about optimizing random_standard_normal

2021-02-06 Thread Robert Kern

On Sat, Feb 6, 2021 at 7:27 AM  wrote:

> Well, I can tell you why it needs to be backward compatible!  I use random
> numbers fairly frequently, and to unit test them I set a specific seed and
> then make sure I get the same answers.
>
> Hmm, I guess that makes sense. I tried to adjust my algorithms to do the
> same thing with the same bit's as the original one, but I couldn't get it
> to work.
>

To be clear, this is not our backwards compatibility policy for the methods
that you have modified. Our policy is spelled out here:

  https://numpy.org/neps/nep-0019-rng-policy.html

The TestRandomDist suite of tests were adapted from the older RandomState
(which is indeed frozen and not allowed to change algorithms). It's a mix
of correctness tests that are valid regardless of the precise algorithm
(does this method reject invalid arguments? do degenerate arguments yield
the correct constant value?) and actual "has this algorithm changed
unexpectedly?" tests. The former are the most valuable, but the latter are
useful for testing in cross-platform contexts. Compilers and different CPUs
can do naughty things sometimes, and we want the cross-platform differences
to be minimal. When you do change an algorithm implementation for
Generator, as you have done, you are expected to do thorough tests
(offline, not in the unit tests) that it is correctly sampling from the
target probability distribution, then once satisfied, change the hard-coded
values in TestRandomDist to match whatever you are generating.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NumPy-Discussion Digest, Vol 171, Issue 39

2020-12-29 Thread Robert Kern

On Tue, Dec 29, 2020 at 5:35 PM Brian Soto  wrote:

> I'm still learning proper mailing list etiquette so I'm not sure if this
> is where I should respond.
>

FWIW, if you want to reply to conversations regularly, it's more ergonomic
all around to subscribe regularly and not use the digests. But if you only
occasionally want to jump in, using the digest is fine. It helps everyone
if you change the subject line back to that of the thread that you are
replying to and trimming the digest to just the email you are responding to.

Thank you! I hope this is helpful.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Addition of new distributions: Polya-gamma

2020-12-28 Thread Robert Kern

My view is that we will not add more non-uniform distribution (i.e. "named"
statistical probability distributions like Polya-Gamma) methods to
`Generator`. I think that we might add a couple more methods to handle some
more fundamental issues (like sampling from the unit interval with control
over whether each boundary is open or closed, maybe one more variation on
shuffling) that helps write randomized algorithms. Now that we have the C
and Cython APIs which allow one to implement non-uniform distributions in
other packages, we strongly encourage that.

As I commented on the linked PR, `scipy.stats` would be a reasonable place
for a Polya-Gamma sampling function, even if it's not feasible to implement
an `rv_continuous` class for it. You have convinced me that the nature of
the Polya-Gamma distribution warrants this. The only issue is that scipy
still depends on a pre-`Generator` version of numpy. So I recommend
implementing this function in your own package with an eye towards
contributing it to scipy later.

On Sun, Dec 27, 2020 at 6:05 AM Zolisa Bleki  wrote:

> Hi All,
>
> I would like to know if Numpy accepts addition of new distributions since
> the implementation of the Generator interface. If so, what is the criteria
> for a particular distribution to be accepted? The reason why i'm asking is
> because I would like to propose adding the Polya-gamma distribution to
> numpy, for the following reasons:
>
> 1) Polya-gamma random variables are commonly used as auxiliary variables
> during data augmentation in Bayesian sampling algorithms, which have
> wide-spread usage in Statistics and recently, Machine learning.
> 2) Since this distribution is mostly useful for random sampling, it since
> appropriate to have it in numpy and not projects like scipy [1].
> 3) The only python/C++ implementation of the sampler available is licensed
> under GPLv3 which I believe limits copying into packages that choose to use
> a different license [2].
> 4) Numpy's random API makes adding the distribution painless.
>
> I have done preliminary work on this by implementing the distribution
> sampler as decribed in [3]; see:
> https://github.com/numpy/numpy/compare/master...zoj613:polyagamma .
> There is a more efficient sampling algorithm described in a later paper
> [4], but I chose not to start with that one unless I know it is worth
> investing time in.
>
> I would appreciate your thoughts on this proposal.
>
> Regards,
> Zolisa
>
>
> Refs:
> [1] https://github.com/scipy/scipy/issues/11009
> [2] https://github.com/slinderman/pypolyagamma
> [3] https://arxiv.org/pdf/1205.0310v1.pdf
> [4] https://arxiv.org/pdf/1405.0506.pdf
>
>
>
> Disclaimer - University of Cape Town This email is subject to UCT policies
> and email disclaimer published on our website at
> http://www.uct.ac.za/main/email-disclaimer or obtainable from +27 21 650
> 9111. If this email is not related to the business of UCT, it is sent by
> the sender in an individual capacity. Please report security incidents or
> abuse via https://csirt.uct.ac.za/page/report-an-incident.php.
> _______
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] locking np.random.Generator in a cython nogil context?

2020-12-17 Thread Robert Kern

On Thu, Dec 17, 2020 at 9:56 AM Evgeni Burovski 
wrote:

> On Thu, Dec 17, 2020 at 1:01 PM Matti Picus  wrote:
> >
> >
> > On 12/17/20 11:47 AM, Evgeni Burovski wrote:
> > > Just as a side note, this is not very prominent in the docs, and I'm
> > > ready to volunteer to send a doc PR --- I'm only not sure which part
> > > of the docs, and would appreciate a pointer.
> >
> > Maybe here
> >
> >
> https://numpy.org/devdocs/reference/random/bit_generators/index.html#seeding-and-entropy
> >
> > which is here in the sources
> >
> >
> https://github.com/numpy/numpy/blob/master/doc/source/reference/random/bit_generators/index.rst#seeding-and-entropy
> >
> >
> > And/or in the SeedSequence docstring documentation
> >
> >
> https://numpy.org/devdocs/reference/random/bit_generators/generated/numpy.random.SeedSequence.html#numpy.random.SeedSequence
> >
> > which is here in the sources
> >
> >
> https://github.com/numpy/numpy/blob/master/numpy/random/bit_generator.pyx#L255
>
>
> Here's the PR, https://github.com/numpy/numpy/pull/18014
>
> Two minor comments, both OT for the PR:
>
> 1. The recommendation to seed the generators from the OS --- I've been
> bitten by exactly this once. That was a rather exotic combination of a
> vendor RNG and a batch queueing system, and some of my runs did end up
> with identical random streams. Given that the recommendation is what
> it is, it probably means that experience is a singular point and it no
> longer happens with modern generators.
>

I suspect the vendor RNG was rolling its own entropy using time. We use
`secrets.getrandbits()`, which ultimately uses the best cryptographic
entropy source available. And if there is no cryptographic entropy source
available, I think we fail hard instead of falling back to less reliable
things like time. I'm not entirely sure that's a feature, but it is safe!


> 2. Robert's comment that `SeedSequence(..., spawn_key=(num,))`  is not
> equivalent to `SeedSequence(...).spawn(num)[num]` and that the former
> is not recommended. I'm not questioning the recommendation, but then
> __repr__ seems to suggest the equivalence:
>

I was saying that they were equivalent. That's precisely why it's not
recommended: it's too easy to do both and get identical streams
inadvertently.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] locking np.random.Generator in a cython nogil context?

2020-12-14 Thread Robert Kern

On Mon, Dec 14, 2020 at 3:27 PM Evgeni Burovski 
wrote:

> 
>
> > I also think that the lock only matters for Multithreaded code not
> Multiprocess.  I believe the latter pickles and unpickles any Generator
> object (and the underying BitGenerator) and so each process has its own
> version.  Note that when multiprocessing the recommended procedure is to
> use spawn() to generate a sequence of BitGenerators and to use a distinct
> BitGenerator in each process. If you do this then you are free from the
> lock.
>
> Thanks. Just to confirm: does using SeedSequence spawn_key arg
> generate distinct BitGenerators? As in
>
> cdef class Wrapper():
> def __init__(self, seed):
> entropy, num = seed
> py_gen = PCG64(SeedSequence(entropy, spawn_key=(spawn_key,)))
> self.rng = 
> py_gen.capsule.PyCapsule_GetPointer(capsule, "BitGenerator")# <---
> this
>
> cdef Wrapper rng_0 = Wrapper(seed=(123, 0))
> cdef Wrapper rng_1 = Wrapper(seed=(123, 1))
>
> And then,of these two objects, do they have distinct BitGenerators?
>

The code you wrote doesn't work (`spawn_key` is never assigned). I can
guess what you meant to write, though, and yes, you would get distinct
`BitGenerator`s. However, I do not recommend using `spawn_key` explicitly.
The `SeedSequence.spawn()` method internally keeps track of how many
children it has spawned and uses that to construct the `spawn_key`s for its
subsequent children. If you play around with making your own `spawn_key`s,
then the parent `SeedSequence(entropy)` might spawn identical
`SeedSequence`s to the ones you constructed.

If you don't want to use the `spawn()` API to construct the separate
`SeedSequence`s but still want to incorporate some per-process information
into the seeds (e.g. the 0 and 1 in your example), then note that a tuple
of integers is a valid value for the `entropy` argument. You can have the
first item be the same (i.e. per-run information) and the second item be a
per-process ID or counter.

cdef class Wrapper():
def __init__(self, seed):
py_gen = PCG64(SeedSequence(seed))
self.rng = py_gen.capsule.PyCapsule_GetPointer(capsule,
"BitGenerator")

cdef Wrapper rng_0 = Wrapper(seed=(123, 0))
cdef Wrapper rng_1 = Wrapper(seed=(123, 1))

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.{bool,float,int} deprecation

2020-12-11 Thread Robert Kern

On Fri, Dec 11, 2020 at 1:12 PM Aaron Meurer  wrote:

> On Fri, Dec 11, 2020 at 1:47 AM Eric Wieser 
> wrote:
> >
> > >  you might want to discuss this with us at the array API standard
> > > https://github.com/data-apis/array-api (which is currently in RFC
> > > stage). The spec uses bool as the name for the boolean dtype.
> >
> > I don't fully understand this argument - `np.bool` is already not the
> boolean dtype. Either:
>
> The spec does deviate from what NumPy currently does in some places.
> If we wanted to just copy NumPy exactly, there wouldn't be a need for
> a specification.

I wouldn't take that as a premise. Specifying a subset of the vast existing
NumPy API would be a quite valuable specification in its own right. I find
the motivation for deviation laid out in the Purpose and Scope
<https://data-apis.github.io/array-api/latest/purpose_and_scope.html#introduction>
section
to be reasonably convincing that deviation might be needed *somewhere*. The
question then is, is *this* deviation supporting that stated motivation, or
is it taking the opportunity of a redesign to rationalize the names more to
our current tastes? Given the mode of adopting the standard (a separate
subpackage), that's a reasonable choice to make, but let's be clear about
the motivation. I submit that keeping the name `bool_` does not make it any
harder for other array APIs to adopt the standard. It's just that few
people would design a new API with that name if they were designing a
greenfield API.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.{bool,float,int} deprecation

2020-12-09 Thread Robert Kern

On Wed, Dec 9, 2020 at 4:08 PM Aaron Meurer  wrote:

> On Wed, Dec 9, 2020 at 9:41 AM Sebastian Berg
>  wrote:
> >
> > On Mon, 2020-12-07 at 14:18 -0700, Aaron Meurer wrote:
> > > Regarding np.bool specifically, if you want to deprecate this, you
> > > might want to discuss this with us at the array API standard
> > > https://github.com/data-apis/array-api (which is currently in RFC
> > > stage). The spec uses bool as the name for the boolean dtype.
> > >
> > > Would it make sense for NumPy to change np.bool to just be the
> > > boolean
> > > dtype object? Unlike int and float, there is no ambiguity with bool,
> > > and NumPy clearly doesn't have any issues with shadowing builtin
> > > names
> > > in its namespace.
> >
> > We could keep the Python alias around (which for `dtype=` is the same
> > as `np.bool_`).
> >
> > I am not sure I like the idea of immediately shadowing the builtin.
> > That is a switch we can avoid flipping (without warning); `np.bool_`
> > and `bool` are fairly different beasts? [1]
>
> NumPy already shadows a lot of builtins, in many cases, in ways that
> are incompatible with existing ones. It's not something I would have
> done personally, but it's been this way for a long time.
>

Sometimes, we had the function first before Python added them to the
builtins (e.g. sum(), any(), all(), IIRC). I think max() and min() are the
main ones that we added after Python did, and we explicitly exclude them
from __all__ to avoid clobbering the builtins.

Shadowing the types (bool, int, float) historically tended to be more
problematic than those functions. The first releases of numpy _did_ have
those as the scalar types. That empirically turned out to cause more
problems for people than sum() or any(), so we renamed the scalar types to
have the trailing underscore. We only left the shadowed names as aliases
for the builtins because enough people still had `dtype=np.float` in their
code that we didn't want to break.

All that said, "from numpy import *" is less common these days. We have
been pretty successful at getting people on board with the np campaign.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.{bool,float,int} deprecation

2020-12-06 Thread Robert Kern

On Sun, Dec 6, 2020 at 12:52 AM Stephan Hoyer  wrote:

> On Sat, Dec 5, 2020 at 9:24 PM Mark Harfouche 
> wrote:
>
>> If the answer is to deprecate
>>
>> np.int(1) == int(1)
>>
>> then one can add a warning to the __init__ of the np.int class, but
>> continue to subclass the python int class.
>>
>> It just doesn’t seem worthwhile to to stop people from using dtype=np.int,
>> which seem to read:
>>
>> “I want this to be a numpy integer, not necessarily a python integer”.
>>
> The problem is that there is assuredly code that inadvertently relies upon
> this (mis)feature.
>
> If we change the behavior of np.int() to create np.int64() objects
> instead of int() objects, it is likely to result in breaking some user
> code. Even with a prior warning, this breakage may be surprising and very
> hard to track down. In contrast, it's much safer to simply remove np.int
> entirely, because if users ignore the deprecation they end up with an error.
>

FWIW (and IIRC), *this* was the original misfeature. `np.int`, `np.bool`,
and `np.float` were aliases for their corresponding default scalar types in
the first numpy releases. However, too many people were doing `from numpy
import *` and covering up the builtins. We renamed these aliases with
trailing underscores to avoid that problem, but too many people (even in
those early days) still had uses of `dtype=np.int`. Making `np.int is int`
was the backwards-compatibility hack.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] New package to speed up ufunc inner loops

2020-11-04 Thread Robert Kern

On Wed, Nov 4, 2020 at 7:22 PM Aaron Meurer  wrote:

>
> > That's not to say that there isn't clearer language that could be
> drafted. The NEP is still in Draft stage. But if you think it could be
> clearer, please propose specific edits to the draft. Like with unclear
> documentation, it's the person who finds the current docs
> insufficient/confusing/unclear that is in the best position to recommend
> the language that would have helped them. Collaboration helps.
>
> I disagree. The best person to write documentation is the person who
> actually understands the package. I already noted that I don't
> actually understand the actual situation with the trademark, for
> instance.
>

Rather, I meant that the best person to fix confusing language is the
person who was confused, after consultation with the authors/experts come
to a consensus about what was intended.

> I don't really understand why there is pushback for making NEP
> clearer. Also "like with unclear documentation", if someone says that
> documentation is unclear, you should take their word for it that it
> actually is, and improve it, rather than somehow trying to argue that
> they actually aren't confused.
>

I'm not. I'm saying that I don't know how to make it more clear to those
people because I'm not experiencing it like they are. The things I could
think to add are the same kinds of things that were already stated
explicitly in the Abstract, Motivation, and Scope. It seems like Stefan is
in the same boat. Authors need editors, but the editor can't just say
"rewrite!" I don't know what kind of assumptions and context this
hypothetical reader is bringing to this reading that are leading to
confusion. Sometimes it's clear, but not for me, here (and more relevantly,
Stefan).

Do you think this needs a complete revamp? Or just an additional sentence
to explicitly state that this does not add additional legal restrictions to
the copyright license?

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

1 2 3 >

1 - 100 of 207 matches

Mail list logo