[Numpy-discussion] Re: Adding bfill() to numpy.
On Mon, May 20, 2024 at 1:17 PM george trojan wrote: > xarray has both bfill and ffill for DataArrays. The implementation useless > bottleneck https://github.com/pydata/bottleneck. > bottleneck would probably be a good home for these functions (though I don't speak for its maintainers). If it gets picked up by a bunch of other array implementations, then you can make a proposal to have them added to the Array API standard. numpy probably won't add them unless if that happens first. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: dtype=object arrays not treated like Python list?
On Fri, Mar 29, 2024 at 9:11 AM Steven G. Johnson wrote: > Should a dtype=object array be treated more like Python lists for type > detection/coercion reasons? Currently, they are treated quite differently: > > >>> import numpy as np > >>> np.isfinite([1,2,3]) > array([ True, True, True]) > >>> np.isfinite(np.asarray([1,2,3], dtype=object)) > Traceback (most recent call last): > File "", line 1, in > TypeError: ufunc 'isfinite' not supported for the input types, and the > inputs could not be safely coerced to any supported types according to the > casting rule ''safe'' > > The reason I ask is that we ran into something similar when trying to pass > wrappers around Julia arrays to Python. A Julia `Any[]` array is much like > a Python list or a Numpy `object` array, and exposed in Python as a subtype > of MutableSequence, but we found that it was treated by NumPy as more like > a Numpy `object` array than a Python list ( > https://github.com/JuliaPy/PythonCall.jl/issues/486). > > Would it be desirable to treat a 1d Numpy `object` array more like a > Python `list`? Or is there a way for externally defined types to opt-in > to the `list` behavior? (I couldn't figure out in the numpy source code > where `list` is being special-cased?) > `list` isn't special-cased, per se. Most numpy functions work on `ndarray` objects and accept "array-like" objects like `list`s by coercing them to `ndarray` objects ASAP using `np.asarray()`. `asarray` will leave existing `ndarray` instances alone. When you pass `[1,2,3]` to a numpy function, `np.asarray([1,2,3])` takes that list and infers the dtype and shape from its contents using just the `Sequence` API. Since they are `int` objects of reasonable size, the result is `np.array([1, 2, 3], dtype=np.int64)`. The `isfinite` ufunc has a loop defined for `dtype=np.int64`. `np.asarray()` will also check for special methods and properties like `__array__()` and `__array_interface__` to allow objects to customize how they should be interpreted as `ndarray` objects, in particular, allowing memory to be shared efficiently as pointers, if the layouts are compatible, but also to control the dtype of the final array. PythonCall.jl implements `__array__()` and `__array_interface__` for its array objects for these purposes, and if I am not mistaken, explicitly makes `Any[]` convert to a `dtype=object` `ndarray` <https://github.com/JuliaPy/PythonCall.jl/blob/f586f2494432f2e5366ff1e2876b8aa532630b54/src/JlWrap/objectarray.jl#L61-L69> (the `typestr = "|O"` is equivalent to `dtype=object` in that context). It's *possible* that you'd really rather have this `PyObjectArray` *not* implement those interfaces and just let the `np.asarray()` inference work its magic through the `Sequence` API, but that is expensive. You could also try doing the type inference yourself and implement that in the `PyObjectArray.__array__()` implementation and avoid implementing `__array_interface__` for that object. Then `np.asarray()` will just delegate to `PyObjectArray.__array__()`. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Polyfit error in displacement
On Tue, Mar 26, 2024 at 3:39 PM Luca Bertolotti wrote: > Thanks for your reply yes it seems more appropriate a cubic spline but how > can i get what they call SC > I don't think any of us has enough context to know what "SC" is. It's not a standard term that I'm aware of. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Converting array to string
On Sun, Feb 25, 2024 at 1:52 PM Dom Grigonis wrote: > Thank you for your answer, > > Yeah, I was unsure if it ever existed in the first place. > > Space is less of an issue and something like `a.data.hex()` would be fine > as long as its speed was on par with `a.tobytes()`. However, it is 10x > slower on my machine. > > This e-mail is pretty much a final check (after a fair bit of research and > attempts) that it can not be done so I can eliminate this possibility as > feasible and concentrate on other options. > I think that mostly runs the gamut for pure JSON. Consider looking at BJData, but it's "JSON-based" and not quite pure JSON. https://neurojson.org/ -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Converting array to string
On Sat, Feb 24, 2024 at 7:17 PM Dom Grigonis wrote: > Hello, > > I am seeking a bit of help. > > I need a fast way to transfer numpy arrays in json format. > > Thus, converting array to list is not an option and I need something > similar to: > > a = np.ones(1000)%timeit a.tobytes()17.6 ms > > This is quick, fast and portable. In other words I am very happy with > this, but... > > Json does not support bytes. > > Any method of subsequent conversion from bytes to string is number of > times slower than the actual serialisation. > > So my question is: Is there any way to serialise directly to string? > > I remember there used to be 2 methods: tobytes and tostring. However, I > see that tostring is deprecated and its functionality is equivalent to > `tobytes`. Is it completely gone? Or is there a chance I can still find a > performant version of converting to and back from `str` type of non-human > readable form? > The old `tostring` was actually the same as `tobytes`. In Python 2, the `str` type was what `bytes` is now, a string of octets. In Python 3, `str` became a string a Unicode characters (what you want) and the `bytes` type was introduced for the string of octects so `tostring` was merely _renamed_ to `tobytes` to match. `tostring` never returned a string of Unicode characters suitable for inclusion in JSON. AFAICT, there is not likely to be a much more efficient way to convert from an array to a reasonable JSONable encoding (e.g. base64). The time you are seeing is the time it takes to encode that amount of data, period. That said, if you want to use a quite inefficient hex encoding, `a.data.hex()` is somewhat faster than the base64 encoding, but it's less than ideal in terms of space usage. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: ENH: Introducing a pipe Method for Numpy arrays
On Thu, Feb 15, 2024 at 10:21 AM wrote: > Hello Numpy community, > > I'm proposing the introduction of a `pipe` method for NumPy arrays to > enhance their usability and expressiveness. > Adding a prominent method like this to `np.ndarray` is something that we will probably not take up ourselves unless it is adopted by the Array API standard <https://data-apis.org/array-api/latest/>. It's possible that you might get some interest there since the Array API deliberately strips out the number of methods that we already have (e.g. `.mean()`, `.sum()`, etc.) in favor of functions. A general way to add some kind of fluency cheaply in an Array API-agnostic fashion might be helpful to people trying to make their numpy-only code that uses our current set of methods in this way a bit easier. But you'll have to make the proposal to them, I think, to get started. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Feature request: Extension of the np.argsort Function - Returning Positional Information for Data
On Tue, Jan 16, 2024 at 11:05 PM hao chen wrote: > When dealing with lists that contain duplicate data, np.argsort fails to > return index values that correspond to the actual sorting positions of the > data, as it does when handling arrays without duplicates. > > Dear Author: > > When I use the np.argsort function on an array without duplicate data, the > returned index values correspond to the sorting positions of the respective > data. > > x = [1, 2, 5, 4] > rank = np.argsort(x) > print(rank) > # [0 1 3 2] > > That is not what `argsort` is intended or documented to do. It returns an array of indices _into `x`_ such that if you took the values from `x` in that order, you would get a sorted array. That is, if `x` were sorted into the array `sorted_x`, then `x[rank[i]] == sorted_x[i]` for all `i in range(len(x))`. The indices in `rank` are positions in `x`, not positions in `sorted_x`. They happen to correspond in this case, but that's a coincidence that's somewhat common in these small examples. But consider `[20, 30, 10, 40]`: >>> x = np.array([20, 30, 10, 40]) >>> ix = np.argsort(x) >>> def position(x): ... sorted_x = np.array(x) ... sorted_x.sort() ... return np.searchsorted(sorted_x, x) ... >>> ip = position(x) >>> ix array([2, 0, 1, 3]) >>> ip array([1, 2, 0, 3]) But also notice: >>> np.argsort(np.argsort(x)) array([1, 2, 0, 3]) This double-argsort is what you seem to be looking for, though it depends on what you want from the handling of duplicates (do you return the first index into the sorted array with the same value as in my `position()` implementation, or do you return the index that particular item was actually sorted to). Either way, we probably aren't going to add this as its own function. Both options are straightforward combinations of existing primitives. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Change definition of complex sign (and use it in copysign)
On Wed, Jan 3, 2024 at 4:09 PM Aaron Meurer wrote: > sign(z) = z/|z| is a fairly standard definition. See > https://oeis.org/wiki/Sign_function and > https://en.wikipedia.org/wiki/Sign_function. It's also implemented > this way in MATLAB and Mathematica (see > https://www.mathworks.com/help/symbolic/sign.html and > https://reference.wolfram.com/language/ref/Sign.html). The function > z/|z| is useful because it represents a normalization of z as a vector > in the complex plane onto the unit circle. > > With that being said, I'm not so sure about the suggestion about > extending copysign(x1, x2) as |x1|*sign(x2). I generally think of > copysign as a function to manipulate the floating-point representation > of a number. It literally copies the sign *bit* from x2 into x1. It's > useful because of things like -0.0, which is otherwise difficult to > work with since it compares equal to 0.0. I would find it surprising > for copysign to do a numeric calculation on complex numbers. Also, > your suggested definition would be wrong for 0.0 and -0.0, since > sign(0) is 0, and this is precisely where copysign matters. > Agreed on all points. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: savetxt() has fmt='%.18e' as default, but fmt='%.16e' is always sufficient
On Sat, Nov 25, 2023 at 11:18 AM Jeppe Dakin wrote: > Double-precision numbers need at most 17 significant decimal digits to be > serialised losslessly. Yet, savetxt() uses 19 by default, meaning that most > files produced with savetxt() takes up about 9% more disk space than they > need to, without any benefit. I have described the problem more detailed on > Stackoverflow: > > https://stackoverflow.com/questions/77535380/minimum-number-of-digits-for-exact-double-precision-and-the-fmt-18e-of-numpy > > Is there any reason behind the default choice of savetxt(..., > fmt='%.18e')? If not, why not reduce it to savetxt(..., fmt='%.16e')? > A long time ago when `savetxt()` was written, we did not use the reliable Dragon4 string representation algorithm that guarantees that floating point numbers are written out with the minimum number of decimal digits needed to reproduce the number. We may even have relied on the platform's floating point to string conversion routines, which were of variable quality. The extra digits accounted for that unreliability. It probably could be changed now, but I'd want more aggressive testing of the assertion of correctness (`random()`, as used in that StackOverflow demonstration, does *not* exercise a lot of the important edge cases in the floating point format). But if your true concern is that 9% of disk space, you probably don't want to be using `savetxt()` in any case. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: pickling dtype values
On Mon, Nov 20, 2023 at 10:08 PM Sebastien Binet wrote: > hi there, > > I have written a Go package[1] that can read/write simple arrays in the > numpy file format [2]. > when I wrote it, it was for simple interoperability use cases, but now > people would like to be able to read back ragged-arrays[3]. > > unless I am mistaken, this means I need to interpret pieces of pickled > data (`ndarray`, `multiarray` and `dtype`). > > so I am trying to understand how to unpickle `dtype` values that have been > pickled: > > ```python > import numpy as np > import pickle > import pickletools as pt > > pt.dis(pickle.dumps(np.dtype("int32"), protocol=4), annotate=True) > ``` > > gives: > ``` > 0: \x80 PROTO 4 Protocol version indicator. > 2: \x95 FRAME 55 Indicate the beginning of a new frame. >11: \x8c SHORT_BINUNICODE 'numpy' Push a Python Unicode string object. >18: \x94 MEMOIZE(as 0)Store the stack top into the memo. > The stack is not popped. >19: \x8c SHORT_BINUNICODE 'dtype' Push a Python Unicode string object. >26: \x94 MEMOIZE(as 1)Store the stack top into the memo. > The stack is not popped. >27: \x93 STACK_GLOBAL Push a global object (module.attr) on > the stack. >28: \x94 MEMOIZE(as 2)Store the stack top into the memo. > The stack is not popped. >29: \x8c SHORT_BINUNICODE 'i4'Push a Python Unicode string object. >33: \x94 MEMOIZE(as 3)Store the stack top into the memo. > The stack is not popped. >34: \x89 NEWFALSE Push False onto the stack. >35: \x88 NEWTRUE Push True onto the stack. >36: \x87 TUPLE3 Build a three-tuple out of the top > three items on the stack. >37: \x94 MEMOIZE(as 4)Store the stack top into the memo. > The stack is not popped. >38: RREDUCE Push an object built from a callable > and an argument tuple. >39: \x94 MEMOIZE(as 5)Store the stack top into the memo. > The stack is not popped. >40: (MARK Push markobject onto the stack. >41: KBININT13 Push a one-byte unsigned integer. >43: \x8c SHORT_BINUNICODE '<' Push a Python Unicode string object. >46: \x94 MEMOIZE(as 6)Store the stack top into the memo. > The stack is not popped. >47: NNONE Push None on the stack. >48: NNONE Push None on the stack. >49: NNONE Push None on the stack. >50: JBININT -1Push a four-byte signed integer. >55: JBININT -1Push a four-byte signed integer. >60: KBININT10 Push a one-byte unsigned integer. >62: tTUPLE (MARK at 40) Build a tuple out of the topmost > stack slice, after markobject. >63: \x94 MEMOIZE(as 7) Store the stack top into the > memo. The stack is not popped. >64: bBUILD Finish building an object, via > __setstate__ or dict update. >65: .STOPStop the unpickling machine. > highest protocol among opcodes = 4 > ``` > > I have tried to find the usual `__reduce__` and `__setstate__` methods to > understand what are the various arguments, to no avail. > First, be sure to read the generic `object.__reduce__` docs: https://docs.python.org/3.11/library/pickle.html#object.__reduce__ Here is the C source for `np.dtype.__reduce__()`: https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2623-L2750 And `np.dtype.__setstate__()`: https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2787-L3151 so, in : > ```python > >>> np.dtype("int32").__reduce__()[1] > ('i4', False, True) > These are arguments to the `np.dtype` constructor and are documented in `np.dtype.__doc__`. The `False, True` arguments are hardcoded and always those values. > >>> np.dtype("int32").__reduce__()[2] > (3, '<', None, None, None, -1, -1, 0) > These are arguments to pass to `np.dtype.__setstate__()` after the object has been created. 0. `3` is the version number of the state; `3` is typical for simple dtypes; datetimes and others with metadata will bump this to `4` and use a 9-element tuple instead of this 8-element tuple. 1. `'<'` is the endianness flag. 2. If there are subarrays <https://numpy.org/doc/stable/reference/arrays.dtypes.html#index-7> (e.g. `np.dtype((np.int32, (2,2)))`), that info here. 3. If there are fields, a tuple of the names of the fields 4. If there are fields, the
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 7:11 PM Aaron Meurer wrote: > rng.integers() (or np.random.randint) lets you specify lists for low > and high. So you can just use rng.integers((0,)*len(dims), dims). > > Although I'm not seeing how to use this to generate a bunch of vectors > at once. I would have thought something like size=(10, dims) would let > you generate 10 vectors of length dims but it doesn't seem to work. > `size=(k, len(dims))` def sample_indices(shape, size, rng=None): rng = np.random.default_rng(rng) ashape = np.array(shape) seen = set() while len(seen) < size: dsize = size - len(seen) seen.update(map(tuple, rng.integers(0, ashape, size=(dsize, len(shape) return list(seen) That optimistic optimization makes this the fastest solution. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 5:34 PM Stefan van der Walt wrote: > On Fri, Nov 17, 2023, at 14:28, Stefan van der Walt wrote: > > Attached is a script that implements this solution. > > > And the version with set duplicates checking. > If you're going to do the set-checking yourself, then you don't need the unbounded integer support. This is somewhat slower, probably due to the greedy tuple conversions, but eliminates all of the complexity of subclassing `Random`: def sample_indices(shape, size, rng=None): rng = np.random.default_rng(rng) ashape = np.array(shape) seen = set() while len(seen) < size: idx = tuple(rng.integers(0, ashape)) seen.add(idx) return list(seen) Unfortunately, subclassing from `Random` still doesn't get you the ability to sample without replacement for arbitrary-sized populations using `Random`'s own methods for that. `Random.sample(population, k)` requires `population` to be a `Sequence`, and that restricts you to index-sized integers (`ssize_t`) (you can try to fake one, but `len()` will balk if it gets a too-large integer from `__len__`). But `Random.sample` is only just doing set-checking anyways, so no loss. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 4:15 PM Aaron Meurer wrote: > On Fri, Nov 17, 2023 at 12:10 PM Robert Kern > wrote: > > > > If the arrays you are drawing indices for are real in-memory arrays for > present-day 64-bit computers, this should be adequate. If it's a notional > array that is larger, then you'll need actual arbitrary-sized integer > sampling. The builtin `random.randrange()` will do arbitrary-sized integers > and is quite reasonable for this task. If you want it to use our > BitGenerators underneath for clean PRNG state management, this is quite > doable with a simple subclass of `random.Random`: > https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 > > Wouldn't it be better to just use random.randint to generate the > vectors directly at that point? If the number of possibilities is more > than 2**64, birthday odds of generating the same vector twice are on > the order of 1 in 2**32. And you can always do a unique() rejection > check if you really want to be careful. > Yes, I jumped to correcting the misreading of the docstring rather than solving the root problem. Almost certainly, the most straightforward strategy is to use `rng.integers(0, array_shape)` repeatedly, storing tuples of the resulting indices in a set and rejecting duplicates. You'd have to do the same thing if one used `rng.integers(0, np.prod(array_shape))` or `random.randint(0, np.prod(array_shape))` in the flattened case. `rng.integers()` doesn't sample without replacement in any case. `rng.choice()` does. It also will not support unbounded integers; it uses a signed `int64` to hold the population size, so it is bounded to that. That is _likely_ to be a fine upper bound on the size of the array (if it is a real array in a 64-bit address space). So one could use `rng.choice(np.prod(array_shape), replace=False, size=n_samples)` and unflatten these integers to the index tuples if that suits the array size. It's just doing the same set lookups internally, just somewhat more efficiently. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How to sample unique vectors
On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion < numpy-discussion@python.org> wrote: > Hi all, > > I am trying to sample k N-dimensional vectors from a uniform distribution > without replacement. > It seems like this should be straightforward, but I can't seem to pin it > down. > > Specifically, I am trying to get random indices in an d0 x d1 x d2.. x > dN-1 array. > > I thought about sneaking in a structured dtype into `rng.integers`, but of > course that doesn't work. > > If we had a string sampler, I could sample k unique words (consisting of > digits), and convert them to indices. > > I could over-sample and filter out the non-unique indices. Or iteratively > draw blocks of samples until I've built up my k unique indices. > > The most straightforward solution would be to flatten indices, and to > sample from those. The integers get large quickly, though. The rng.integers > docstring suggests that it can handle object arrays for very large integers: > > > When using broadcasting with uint64 dtypes, the maximum value (2**64) > > cannot be represented as a standard integer type. > > The high array (or low if high is None) must have object dtype, e.g., > array([2**64]). > > But, that doesn't work: > > In [35]: rng.integers(np.array([2**64], dtype=object)) > ValueError: high is out of bounds for int64 > > Is there an elegant way to handle this problem? > The default dtype for the result of `integers()` is the signed `int64`. If you want to sample from the range `[0, 2**64)`, you need to specify `dtype=np.uint64`. The text you are reading is saying that if you want to specify exactly `2**64` as the exclusive upper bound that you won't be able to do it with a `np.uint64` array/scalar because it's one above the bound for that dtype, so you'll have to use a plain Python `int` object or `dtype=object` array in order to represent `2**64`. It is not saying that you can draw arbitrary-sized integers. >>> rng.integers(2**64, dtype=np.uint64) 11569248114014186612 If the arrays you are drawing indices for are real in-memory arrays for present-day 64-bit computers, this should be adequate. If it's a notional array that is larger, then you'll need actual arbitrary-sized integer sampling. The builtin `random.randrange()` will do arbitrary-sized integers and is quite reasonable for this task. If you want it to use our BitGenerators underneath for clean PRNG state management, this is quite doable with a simple subclass of `random.Random`: https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Adding NumpyUnpickler to Numpy 1.26 and future Numpy 2.0
On Wed, Oct 11, 2023 at 11:50 PM Aaron Meurer wrote: > Is there a way to make pickle not depend on the specific submodule > that a class is defined in? No. `Unpickler` somehow has to locate the class/reconstruction function. It will have to use the name given by the `__reduce_ex__` during pickling. > Wouldn't this happen again if you ever > decided to rename _core. > Yes. > The underscores in numpy._core._reconstruct don't actually do anything > here in terms of making the interface not public, and if anything, are > really misleading. > There's private and there's private. There are two broad things that could mean: 1. We don't want users playing around with it, importing it directly in their code and using it. 2. Marking that we won't mess around with it without going through the deprecation policy. Usually, these two go together (hiding it from users directly calling it means that we get to mess around with it freely), but when we get to metaprogramming tasks like pickling, they become a little decoupled. Because the "user" that we have to think about isn't a person reading documentation and browsing APIs, but a file that someone created years ago and some infrastructure that interprets that file. I think it's good that `_reconstruct` is hidden from human users and is distinct from almost all of what constitutes "numpy's public API", but that doesn't mean that we can't have a tiny third class of functions where we do preserve some guarantees. It's only occasionally confusing to us core devs, not users (who can rely on the "don't touch underscored names" rule just fine). I'm also curious about this more generally. We tend to think of the > fully qualified name of a class as being an implementation detail for > many libraries and only the top-level lib.Name should be used, but > many things in the language (including this) break this. > Yes, and it's why these things attempt to be scoped in ways that limit this problem. I.e. if you use pickling, you're told to use it only for transient data with the same versions of libraries on both ends of the pipe, but the reality is that it's too useful to avoid in creating files with arbitrarily long lives. Not their fault; they warned us! -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Find location of slice in it's base
On Thu, Aug 31, 2023 at 3:25 PM Dom Grigonis wrote: > Hi everyone, > > I am working with shared arrays and their slices. And trying to subclass > ndarray in such way so that they remap to memory on unpickling. I am using > package SharedArray, which doesn’t micro-manage memory locations, but > rather stores the whole map via shm using unique name. One issue I ran into > is that when I pickle & unpickle slice of memory mapped array, I need to > store the spec of the subset so that I can retrieve the same portion. > > The issue I am working on has been briefly discussed on stack: > https://stackoverflow.com/questions/12421770/find-location-of-slice-in-numpy-array. > Although I am not sure if it is the exact same one. > > So, in short, is there a way to determine location of a slice in original > array. Ideally, it would be indexing which works for any non-copy > slice/subset possible. > If you chase the `.base` chain all the way down, you get to a `SharedArray.map_owner` object with its `.addr` attribute giving the memory address that's assigned to it in the current process. This will be the same address that's in the `'data'` key of that first `ndarray` returned from `SharedArray.create()` that you are making your views from. In your view `ndarray` (which may be a view of views, with slices, transposes, reshapes, and arbitrary `stride_tricks` manipulations in between). If you store the difference between the view `ndarray`'s `'data'` address from the `map_owner.addr` address, I think that's all you need to recreate the array in a different process which gets assigned a different memory address for the same `shm` name. Just add that offset to the `map_owner.addr` and restore the rest of the information in `__array_interface__`, and I think you should be good to go. I don't think you'll need to infer or represent the precise path of Python-level operations (slices, transposes, reshapes, etc.) to which it got to that point. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Inquiry about the difference between numpy.trunc() and numpy.fix()
On Wed, Aug 23, 2023 at 10:06 PM Bao Guo wrote: > I am writing to seek clarification on the specific differences between the > numpy.trunc() and numpy.fix() functions. > > I have come across opinions stating that these two functions exist for > backward compatibility and that they serve different functional > requirements. However, I am puzzled as to why these functions could not be > merged or one of them upgraded to encompass both functionalities. Despite > my efforts in researching and contemplating on this matter, I have not come > across a convincing explanation. > > Could anyone kindly shed some light on this matter? I am seeking a deeper > understanding of the design choices made for these functions and the > rationale behind their coexistence. > > Thank you in advance for your time and assistance. `fix()` was added to numpy back in the super-early days when it was emerging from Numeric, its earlier predecessor. It was named after the equivalent MATLAB function. At some point thereafter, we settled on a policy for inclusion of mathematical ufuncs into numpy. We got tired of arguing whether a special function was "important enough" to be in numpy instead of scipy.special. Eventually, we settled on a bright line that if a math function was in the C99 standard, we'd go ahead and allow a ufunc for it in numpy. `trunc()` is the name of that equivalent function in the C99 standard. It is a ufunc instead of a plain function like `fix()` and has a (strict, I believe) superset of functionality. You can ignore `fix()`, more or less. I'm not sure if it's on the list for deprecation/removal in numpy 2.0, but it certainly could be. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Arbitrarily large random integers
On Sat, Aug 19, 2023 at 10:49 AM Kevin Sheppard wrote: > The easiest way to do this would to to write a pure python implementation > using Python ints of a masked integer sampler. This way you could draw > unsigned integers and then treat this as a bit pool. You would than take > the number of bits needed for your integer, transform these to be a Python > int, and finally apply the mask. > Indeed, that's how `random.Random` does it. I've commented on the issue with an implementation that subclasses `random.Random` to use numpy PRNGs as the source of bits for maximum compatibility with `Random`. The given use case motivating this feature request is networkx, which manually wraps numpy PRNGs in a class that incompletely mimics the `Random` interface. A true subclass eliminates all of the remaining inconsistencies between the two. I'm inclined to leave it at that and not extend the `Generator` interface. https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258 -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Add to NumPy a function to compute cumulative sums from 0.
On Fri, Aug 11, 2023 at 1:47 PM Benjamin Root wrote: > I'm really confused. Summing from zero should be what cumsum() does now. > > ``` > >>> np.__version__ > '1.22.4' > >>> np.cumsum([[1, 2, 3], [4, 5, 6]]) > array([ 1, 3, 6, 10, 15, 21]) > ``` > which matches your example in the cumsum0() documentation. Did something > change in a recent release? > That's not what's in his example. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: np.dot yields a different result when computed in two pieces
Accelerated BLAS operations are accelerated precisely by taking these opportunities to rearrange the computations, not just (or primarily) by parallelism. They are very finely tuned kernels that use the fine details of the CPU/FPU to pipeline instructions (which might be SIMD) and optimize memory movement and register use. In particular, SIMD instructions might be used, on your machine, to deal with rows in batches of 4, giving the pattern that you are seeing. I, personally, am not willing to give up those optimizations for a slightly more predictable order of additions. On Tue, Jul 25, 2023 at 8:04 AM Junyan Xu wrote: > Hello everyone, > > I asked this question > https://stackoverflow.com/questions/76707696/np-dot-yields-a-different-result-when-computed-in-two-pieces > on StackOverflow a few days ago but still haven't got an answer. Long story > short, I discovered (in certain common settings like Google Colab) that the > method NumPy uses to compute the dot product of a 4D vector with a 4 x m > matrix depends on m, so that if you cut a 4 x 2m matrix into two 4 x m > matrices and compute the dot products with the vector individually, you get > a slightly different result. More precisely, I found that: > > For m = 1, the order ((0+1)+2)+3 is used; > for m = 2 and 3, the order (0+1)+(2+3) is used for all entries; > for 4 ≤ m ≤ 7, unknown method is used for the first 4 entries, and the > order ((0+1)+2)+3 is used for the rest; > for 8 ≤ m ≤ 11, unknown method is used for the first 8 entries, and the > order ((0+1)+2)+3 is used for the rest; > for 12 ≤ m ≤ 15, unknown method is used for the first 12 entries, and the > order ((0+1)+2)+3 is used for the rest; > and the pattern probably continues. > > If we instead take the dot product of a m x 4 matrix with a 4D vector > (rowMode = True in my code): > > For m = 1, the order ((0+1)+2)+3 is used; > for m ≥ 2, the order (0+2)+(1+3) is used for all entries (verified up to m > = 17). > > Moreover, if you replace 4 by an arbitrary n, the difference between the > results grows, super-linearly in n in the vec-dot-mat case, but > sub-linearly in the mat-dot-vec case. > > I think this behavior is highly non-intuitive and probably unnecessary: we > are simply repeating the same operation m times (possibly in parallel), why > should the method chosen depends on how many times the operation is > repeated? Therefore, I'd like to propose an enhancement item to make the > result independent of m. I realize this may need to be done upstream, so as > a first step we may need to find out which BLAS library functions etc. are > called to compute dot products and which are responsible for this behavior. > > Notebook for figuring out order of addition: > https://colab.research.google.com/drive/1O4NtTWMiZbSxKFVg-lfBJDVypIBdVJ22?usp=sharing > Notebook for growth with n: > https://colab.research.google.com/drive/10ecRWvcMEXY0RchVlaLyPUFHJBA4KwA6?usp=sharing > > Thank you in advance for your thoughts on this. > > Best regards, > > Junyan > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: robert.k...@gmail.com > -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mixed mode arithmetic
On Tue, Jul 11, 2023 at 10:11 AM Matti Picus wrote: > > On 10/7/23 16:13, Jens Glaser via NumPy-Discussion wrote: > > Hi Matti, > > > > The documentation for numpy.dot currently states > > > > """ > > out > > ndarray, optional > > Output argument. This must have the exact kind that would be returned if > it was not used. In particular, it must have the right type, must be > C-contiguous, and its dtype must be the dtype that would be returned for > dot(a,b). This is a performance feature. Therefore, if these conditions are > not met, an exception is raised, instead of attempting to be flexible. > > """ > > > > I think this means that if dot(a,b) returned FP32 for FP16 inputs, it > would be consistent with this API to supply a full precision output array. > All that would be needed in an actual implementation is a mixed_precision > flag (or output_dtype option) for this op to override the usual type > promotion rules. Do you agree? > > > > Jens > > `np.dot` is strange. Could you use `np.matmul` instead, which is a real > ufunc and (I think) already does this? > Sort of. As currently implemented, no, it won't do what Jens wants because there is no `ee->f` loop implemented (`e` is the typecode for float16, `f` for float32). Passing float16 operands and requesting a float32 output (either with `dtype=np.float32` or providing such an `out=`) will fall down to the `ff->f` loop and cause upcasting of the operands, which is not what they want. But notionally one could add an `ee->f` loop between those two that would catch this case when `dtype=np.float32` is requested. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: mixed mode arithmetic
On Mon, Jul 10, 2023 at 1:49 AM Matti Picus wrote: > On 9/7/23 23:34, glaserj--- via NumPy-Discussion wrote: > > > Reviving this old thread - I note that numpy.dot supports in-place > > computation for performance reasons like this > > c = np.empty_like(a, order='C') > > np.dot(a, b, out=c) > > > > However, the data type of the pre-allocated c array must match the > result datatype of a times b. Now, with some accelerator hardware (i.e. > tensor cores or matrix multiplication engines in GPUs), mixed precision > arithmetics with relaxed floating point precision (i.e.., which are not > necessarily IEEE754 conformant) but with faster performance are possible, > which could be supported in downstream libraries such as cupy. > > > > Case in point, a mixed precision calculation may take half precision > inputs, but accumulate in and return full precision outputs. Due to the > above mentioned type consistency, the outputs would be unnecessarily > demoted (truncated) to half precision again. The current API of numpy does > not expose mixed precision concepts. Therefore, it would be nice if it was > possible to build in support for hardware accelerated linear algebra, even > if that may not be available on the standard (CPU) platforms numpy is > typically compiled for. > > > > I'd be happy to flesh out some API concepts, but would be curious to > first get an opinion from others. It may be necessary to weigh the > complexity of adding such support explicitly against providing minimal > hooks for add-on libraries in the style of JMP (for jax.numpy), or AMP (for > torch). > > > > Jens > > If your goal is "accumulate in and return full precision outputs", then > you can allocate C as the full precision type, and NumPy should do the > right thing. Note it may convert the entire input array to the final > dtype, rather than doing it "on the fly" which could be expensive in > terms of memory. > In this case, no, `np.dot()` will raise an exception if it's not the dtype that `np.dot(a, b)` would naturally produce. It's different from the ufuncs, which do indeed behave like you describe. `np.dot()` implements its own dtype coercion logic. Jens, there's nothing that really prevents adding mixed precision operations. ufuncs let you provide loops with mixed dtypes, mostly to support special functions that take both integer and real arguments (or real and complex). But one could conceivably put in mixed-precision loops. For non-ufuncs like `np.dot()` that implement their own dtype-coercion logic, it's just a matter of coding it up. The main reasons that we don't are that it's a lot of work for possibly marginal gain to put in all of the relevant permutations and that it complicates an already overly-complicated and not-fully-coherent type promotion scheme. `np.dot()` is kind of an oddball already, and "half-precision inputs -> full-precision outputs" might be a worthwhile use case given hardware accelerators. Given that this largely affects non-numpy implementations of the Array API, you probably want to raise it with that group. numpy can implement that logic if the Array API requires it. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Create a method to index N-dim tensors using 1D index #23992
On Tue, Jun 20, 2023 at 11:38 AM Daniel Salles Civitarese < sall...@br.ibm.com> wrote: > ### Proposed new feature or change: > > I work with geospatial data that requires tensors with many dimensions. > One challenge I used to have was when I had to implement a `__getitem__` to > access those tensors using a 1D index. One use case is machine learning, > when one needs to feed models sequentially with sub-tensors of the original > one. A simple example is a matrix of shape 2x2 that has the positions `(0, > 0)`, `(0, 1)`, `(1, 0)`, `(1, 1)`, and I want to access it via indices `[0, > 1, 2, 3]`. So, when I say `__getitem__(2)`, I want the position `(1, 0)`. > If the only reason you want to compute the indices is to access the data using those flattened indices, we use the `.flat` property instead and avoid explicitly computing those indices. ``` [~] |21> x = np.arange(2*3).reshape((2, 3)) [~] |22> x array([[0, 1, 2], [3, 4, 5]]) [~] |23> x.flat[[0, 1, 2, 3]] array([0, 1, 2, 3]) ``` -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Precision changes to sin/cos in the next release?
On Wed, May 31, 2023 at 5:51 PM Benjamin Root wrote: > I think it is the special values aspect that is most concerning. Math is > just littered with all sorts of identities, especially with trig functions. > While I know that floating point calculations are imprecise, there are > certain properties of these functions that still held, such as going from > -1 to 1. > > As a reference point on an M1 Mac using conda-forge: > ``` > >>> import numpy as np > >>> np.__version__ > '1.24.3' > >>> np.sin(0.0) > 0.0 > >>> np.cos(0.0) > 1.0 > >>> np.sin(np.pi) > 1.2246467991473532e-16 > >>> np.cos(np.pi) > -1.0 > >>> np.sin(2*np.pi) > -2.4492935982947064e-16 > >>> np.cos(2*np.pi) > 1.0 > ``` > > Not perfect, but still right in most places. > FWIW, those ~0 answers are actually closer to the correct answers than 0 would be because `np.pi` is not actually π. Those aren't problems in the implementations of np.sin/np.cos, just the intrinsic problems with floating point representations and the choice of radians which places particularly special values at places in between adjacent representable floating point numbers. > I'm ambivalent about reverting. I know I would love speed improvements > because transformation calculations in GIS is slow using numpy, but also > some coordinate transformations might break because of these changes. > Good to know. Do you have any concrete example that might be worth taking a look at in more detail? Either for performance or accuracy. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Precision changes to sin/cos in the next release?
On Wed, May 31, 2023 at 5:01 PM David Menéndez Hurtado < davidmen...@gmail.com> wrote: > On Wed, 31 May 2023, 22:41 Robert Kern, wrote: > >> Not sure anyone really uses tanh for serious work. >> > > At the risk of derailing the discussion, the case I can think of (but kind > of niche) is using neural networks to approximate differential equations. > Then you need non linearities in the gradients everywhere. > Fair. I actually meant to delete that sentence during the last editing pass. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Precision changes to sin/cos in the next release?
On Wed, May 31, 2023 at 12:37 PM Devulapalli, Raghuveer < raghuveer.devulapa...@intel.com> wrote: > I wouldn’t the discount the performance impact on real world benchmarks > for these functions. Just to name a couple of examples: > > > >- 7x speed up of np.exp and np.log results in a 2x speed up of >training neural networks like logistic regression [1]. I would expect >np.tanh will show similar results for neural networks. >- Vectorizing even simple functions like np.maximum results in a 1.3x >speed up of sklearn’s Kmeans algorithm [2] > > > > Raghuveer > > > > [1] https://github.com/numpy/numpy/pull/13134 > > [2] https://github.com/numpy/numpy/pull/14867 > Perfect, those are precisely the concrete use cases I would want to see so we can talk about the actual ramifications of the changes. These particular examples suggest to me that a module or package providing fast-inaccurate functions would be a good idea, but not across-the-board fast-inaccurate implementations (though it's worth noting that the exp/log/maximum replacements that you cite don't seem to be particularly inaccurate). The performance improvements show up in situational use cases. Logistic regression is not really a neural network (unless if you squint real hard) so the loss function does take a significant amount of whole function performance; the activation and loss functions of real neural networks take up a rather small amount of time compared to the matmuls. Nonetheless, people do optimize activation functions, but often by avoiding special functions entirely with ReLUs (which have other benefits in terms of nice gradients). Not sure anyone really uses tanh for serious work. ML is a perfect use case for *opt-in* fast-inaccurate implementations. The whole endeavor is to replace complicated computing logic with a smaller number of primitives that you can optimize the hell out of, and let the model size and training data size handle the complications. And a few careful choices by people implementing the marquee packages can have a large effect. In the case of transcendental activation functions in NNs, if you really want to optimize them, it's a good idea to trade *a lot* of accuracy (measured in %, not ULPs) for performance, in addition to doing it on GPUs. And that makes changes to the `np.*` implementations mostly irrelevant for them, and you can get that performance without making anyone else pay for it. Does anyone have compelling concrete use cases for accelerated trig functions, per se, rather than exp/log and friends? I'm more on board with accelerating those than trig functions because of their role in ML and statistics (I'd still *prefer* to opt in, though). They don't have many special values (which usually have alternates like expm1 and log1p to get better precision in any case). But for trig functions, I'm much more likely to be doing geometry where I'm with Archimedes: do not disturb my circles! -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Precision changes to sin/cos in the next release?
On Wed, May 31, 2023 at 10:40 AM Ralf Gommers wrote: > > > On Wed, May 31, 2023 at 4:19 PM Charles R Harris < > charlesr.har...@gmail.com> wrote: > >> >> >> On Wed, May 31, 2023 at 8:05 AM Robert Kern >> wrote: >> >>> I would much, much rather have the special functions in the `np.*` >>> namespace be more accurate than fast on all platforms. These would not >>> have been on my list for general purpose speed optimization. How much time >>> is actually spent inside sin/cos even in a trig-heavy numpy program? And >>> most numpy programs aren't trig-heavy, but the precision cost would be paid >>> and noticeable even for those programs. I would want fast-and-inaccurate >>> functions to be strictly opt-in for those times that they really paid off. >>> Probably by providing them in their own module or package rather than a >>> runtime switch, because it's probably only a *part* of my program that >>> needs that kind of speed and can afford that precision loss while there >>> will be other parts that need the precision. >>> >>> >> I think that would be a good policy going forward. >> > > There's a little more to it than "precise and slow good", "fast == less > accurate == bad". We've touched on this when SVML got merged (e.g., [1]) > and with other SIMD code, e.g. in the "Floating point precision > expectations in NumPy" thread [2]. Even libm doesn't guarantee the best > possible result of <0.5 ULP max error, and there are also considerations > like whether any numerical errors are normally distributed around the exact > mathematical answer or not (see, e.g., [3]). > If we had a portable, low-maintenance, high-accuracy library that we could vendorize (and its performance cost wasn't *horrid*), I might even advocate that we do that. Reliance on platform libms is mostly about our maintenance burden than a principled accuracy/performance tradeoff. My preference *is* definitely firmly on the "precise and slow good" for these ufuncs because of the role these ufuncs play in real numpy programs; performance has a limited, situational effect while accuracy can have substantial ones across the board. It seems fairly clear that with this recent change, the feeling is that the > tradeoff is bad and that too much accuracy was lost, for not enough > real-world gain. However, we now had several years worth of performance > work with few complaints about accuracy issues. > Except that we get a flurry of complaints now that they actually affect popular platforms. I'm not sure I'd read much into a lack of complaints before that. > So I wouldn't throw out the baby with the bath water now and say that we > always want the best accuracy only. It seems to me like we need a better > methodology for evaluating changes. Contributors have been pretty careful, > but looking back at SIMD PRs, there were usually detailed benchmarks but > not always detailed accuracy impact evaluations. > I've only seen micro-benchmarks testing the runtime of individual functions, but maybe I haven't paid close enough attention. Have there been any benchmarks on real(ish) *programs* that demonstrate what utility these provide in even optimistic scenarios? I care precisely <1ULP about the absolute performance of `np.sin()` on its own. There are definitely programs that would care about that; I'm not sure any of them are (or should be) written in Python, though. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Precision changes to sin/cos in the next release?
I would much, much rather have the special functions in the `np.*` namespace be more accurate than fast on all platforms. These would not have been on my list for general purpose speed optimization. How much time is actually spent inside sin/cos even in a trig-heavy numpy program? And most numpy programs aren't trig-heavy, but the precision cost would be paid and noticeable even for those programs. I would want fast-and-inaccurate functions to be strictly opt-in for those times that they really paid off. Probably by providing them in their own module or package rather than a runtime switch, because it's probably only a *part* of my program that needs that kind of speed and can afford that precision loss while there will be other parts that need the precision. On Wed, May 31, 2023 at 1:59 AM Sebastian Berg wrote: > Hi all, > > there was recently a PR to NumPy to improve the performance of sin/cos > on most platforms (on my laptop it seems to be about 5x on simple > inputs). > This changes the error bounds on platforms that were not previously > accelerated (most users): > > https://github.com/numpy/numpy/pull/23399 > > The new error is <4 ULP similar to what it was before, but only on high > end Intel CPUs which not users would have noticed. > And unfortunately, it is a bit unclear whether this is too disruptive > or not. > > The main surprise is probably that the range of both does not include 1 > (and -1) exactly with this and quite a lot of downstream packages > noticed this and needed test adaptions. > > Now, most of these are harmless: users shouldn't expect exact results > from floating point math and test tolerances need adjustment. OTOH, > sin/cos are practically 1/-1 on a wide range of inputs (they are > basically constant) so it is surprising that they deviate from it and > never reach 1/-1 exactly. > > Since quite a few downstream libs notice this and NumPy users cannot > explicitly opt-in to a different performance/precision trade-off. The > question is how everyone feels about it being better to revert for now > and hope for a better one? > > I doubt we can decide on a very clear cut yes/no, but I am very > interested what everyone thinks whether this precision trade-off is too > surprising to users? > > Cheers, > > Sebastian > > > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: robert.k...@gmail.com > -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Proposal: "intrange" for inclusive integer intervals
sive floating point ranges by scaling: > > intrange(10, 15, bound="closed") / 10 > [1.0, 1.1, 1.2, 1.3, 1.4, 1.5] > > This way, we don't have to worry about floating point imprecision screwing > up the bounds calculation. > > Sketch implementation: > > def intrange(start, stop, step=1, bound="open", anchor="start"): > match (anchor, bound): > case ("start", "open"): > return np.arange(start, stop, step) > case ("start", "closed"): > return np.arange(start, stop + step, step) > case ("start", "exact"): > result = np.arange(start, stop + step, step) > result[-1] = stop > return result > case ("stop", "open"): > return np.flip(np.arange(stop, start, -step)) > case ("stop", "closed"): > return np.flip(np.arange(stop, start - step, -step)) > case ("stop", "exact"): > result = np.flip(np.arange(stop, start - step, -step)) > result[0] = start > return result > case _: > assert False > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: robert.k...@gmail.com > -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Managing integer overflow
On Tue, May 2, 2023 at 8:03 AM wrote: > > I think this example shows that you don't need any special infrastructure > > from numpy. I don't think there is going to be much appetite to expand > our > > API in this direction. > > But I do! I'm looking for something that implements the > multiply_check_ov(X,Y) and similar functionality for addition and > subtraction, for large arrays; I don't know how to do that efficiently. (Or > even correctly for 64-bit operands --- for 8/16/32 I suppose as a baseline > I could do the math in 64-bit arithmetic and if the answer doesn't match, > then use the sign bit of the 64-bit result for determining overflow.) > Some C compilers provide some builtin functions that do this for you, and some cross-platform headers around that show how to do this. https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html https://github.com/pytorch/pytorch/blob/main/c10/util/safe_numerics.h https://github.com/dcleblanc/SafeInt -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Managing integer overflow
Two boolean arrays are not more compact than one signed `int8` array. But since you'd have to make bool masks from such an array anyways to do anything useful, you might as well pass the bool masks. I think this example shows that you don't need any special infrastructure from numpy. I don't think there is going to be much appetite to expand our API in this direction. On Thu, Apr 27, 2023 at 9:13 AM wrote: > > Ideally if there is an overflow in an array operation I'd like to > produce an "overflow array" that's the same size as the result array, but > with the values +1 if it's a positive overflow or -1 if it's a negative > overflow. > > Alternatively, if boolean arrays are a much more compact memory footprint, > then return two overflow arrays, one positive and one negative. > > Use case: > > X = ... something ... > Y = ... something else ... > Z, ovpos, ovneg = multiply_check_ov(X,Y) > if ovpos.any(): >np.iinfo(Z.dtype).max > if ovneg.any(): ># uh oh, something we didn't expect, so raise an error > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: robert.k...@gmail.com > -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Giving deprecation of e.g. `float(np.array([1]))` a shot (not 0-d)
On Thu, Apr 20, 2023 at 12:39 PM Evgeni Burovski wrote: > If symmetry w.r.t. pytorch is any guide, it was nice to have it: > > In [38]: float(torch.as_tensor([2])) > Out[38]: 2.0 > > In [39]: float(np.asarray([2])) > Out[39]: 2.0 > My question would be: Did they have a positive use case for this behavior, or were they just reflecting NumPy's behavior? AFAICR, the main reasoning on our side was that there was an unambiguous value that we _could_ return, so we might as well. And in our later experience, it was more trouble than it was worth. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: broadcasting question
On Wed, Mar 22, 2023 at 9:34 AM Neal Becker wrote: > I have a function F > def F(a, b): > c = a * b > > Initially, a is a scalar, b[240,3000]. No problem. > Later I want to use F, where a[240] is a vector. I want to allow both the > scalar and vector cases. So I write: > > def F(a,b): > a = np.atleast_1d(a) > c = a[:,None] * b > > This now works for scalar a or vector a. But this solutions seems > inelegant, and somewhat fragile. Suppose later we want to allow > a[240,3000], a 2d array matching b. > > Certainly don't want to write code like: > if a.ndim == 0:... > > Is there a more elegant/robust approach? > I would leave it as `c = a * b` and simply record in the docstring that `a` and `b` should be broadcastable. Yes, that means that the user will have to write `F(a[:, np.newaxis], b)` for that one case, and that looks a little ugly, but overall it's less cognitive load on the user to just reuse the common convention of broadcasting than to record the special case. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Dear Robert
On Sat, Feb 25, 2023 at 8:17 PM Louis Petingi wrote: > Hi Robert > > Just a follow up. I was able (my student) to get the 1 vector from the 0 > eigenvector. Even though the normalized or this set of eigenvectors will > work we could try the two sets. Not sure if multiplying the unit vectors by > a scalar is the correct approach. > Yeah, that essentially amounts to multiplying all of the eigenvectors by `np.sqrt(n)`. I don't imagine it will do much to change the results of K-means or other clustering algorithm unless if it has absolute distance thresholds embedded in it. But it's absolutely a free choice for you to precondition things for your clustering algorithm. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Dear Robert
On Sat, Feb 25, 2023 at 5:33 PM Louis Petingi wrote: > As you mentioned this is a generalization of the Fiedler eigenvector. When > applying spectral clustering, and you want to find the two clusters > then the Fiedler eigenvector tells you how to partition the vertices > (bipartition) so the normalized cut is minimized. The concept can be > generalized to > k clusters by applying K-mean to the first k eigenvectors. The two > partitions can be determined by grouping the vertices of the corresponding > negative values of the eigenvector > for one cluster and the other cluster are vertices corresponding to the > non-negative values. > > I can use perhaps the normalized eigenvectors, but I am not sure if this > is correct thus, I prefer using the magnitudes. In theory it may work as > for example the entries of the Fielder eigenvector are divided by the norm. > My understanding of this family of techniques is that the K-means step is basically heuristic (see section 8.4 of [1]). You have a lot of choices here, and playing around with the magnitudes might be among them. But there is nothing that comes _from_ the eigenvector calculation that can help determine what this should be. There's nothing special about the way that `np.linalg.eig()` computes eigenvectors before normalization that will improve the K-means step. Rather, you have the freedom to scale the eigenvectors freely to help out the K-means calculation (or any other clustering that you might want to do there). I suspect that the unit-normalization is probably one of the better ones. [1] http://people.csail.mit.edu/dsontag/courses/ml14/notes/Luxburg07_tutorial_spectral_clustering.pdf > The eigenvalues and eigenvector calculators (in internet) find the > magnitude of the vector. > No, they do not find "the magnitude" because there is no such thing. They just give _an_ arbitrary magnitude that happened to be the raw output of their _particular_ algorithm without the following unit-normalization step. There are lots of algorithms to compute eigenvectors. The magnitudes of the eigenvector solutions before normalization are not all the same for all of the different algorithms. There is absolutely no guarantee (and very little probability) that if we modified `np.linalg.eig()` to not apply normalization that the unnormalized eigenvectors would match the unnormalized eigenvectors of any other implementation. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: non normalised eigenvectors
On Sat, Feb 25, 2023 at 4:11 PM Louis Petingi wrote: > Hi Thanks > > Very simply one of the solutions for the zero eigenvalue is the 1 > eigenvector. If I get back this 1 vector, for the 0 eigenvalue then the > other eigenvectors will be in the right format I am looking for. Once > again, the 1 vector is the normalized eigenvector * norm. > There are multiple rules that could get this result. You could multiply all of the eigenvectors by `sqrt(n)`. Alternately, you could multiply each eigenvector `v` by `1/v[np.nonzero(v)[0][0]]` (i.e. the first nonzero value in the eigenvector). Both would give you the desired result for the 0-eigenvalue eigenvector, but different results for every other eigenvector. Both are simple and coherent rules, but quite likely neither are what you are looking for. Remember, each eigenvector can be scaled independently of any other, so establishing the desired result for one eigenvector does nothing to constrain any other. If you want help finding the rule that will help you in your research, you'll have to let us know what you are going to use the magnitudes _for_. The information you've given so far (where the eigenvectors come _from_) doesn't actually help narrow anything down. The only application of eigenvector magnitudes of graph Laplacians that I am aware of is the Fiedler vector, and that actually requires unit eigenvectors. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: non normalised eigenvectors
On Sat, Feb 25, 2023 at 2:11 PM Louis Petingi wrote: > Thank you for the reply. I am working with the Laplacian matrix of a graph > which is the Degree matrix minus the adjacency matrix. > The Laplacian is a symmetric matrix and the smallest eigenvalue is zero. > As the rows add it to 0, Lx=0x, and 1 is the resulting vector. The > normalized eigenvector is the 1 vector divided by the norm. So if a have 10 > vertices of the graph the normalized eigenvector is 1/sqrt(10). I do > understand that a scale * normalized eigenvector is also a solution but for > the purpose of my research I need the normalized eigenvector * norm. > I apologize, but I'm going to harp on this: you are using the word "the" as if there is one unique magnitude that we could report. There is none. If you have a use for a specific convention for the magnitudes, you'll have to point us to the literature that talks about it, and we might be able to give you pointers as to how to get the magnitudes that your research requires. For the 0 eigenvalue the norm of the eigenvector is easy to figure out but > not for the other eigenvalues. > > That is what I meant by the original eigenvector and sorry > for the confusion the confusion. Most eigenvalues/eigenvalues calculators > will give you 1 for first eigenvector > I'm afraid that doesn't really narrow anything down for us. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: non normalised eigenvectors
On Sat, Feb 25, 2023 at 11:39 AM wrote: > Dear all, > > I am not an expert in NumPy but my undergraduate student is having some > issues with the way Numpy returns the normalized eigenvectors corresponding > to the eigenvalues. We do understand that an eigenvector is divided by the > norm to get the unit eigenvectors, however we do need the original vectors > for the purpose of my research. This has been a really frustrated > experience as NumPy returns the normalized vectors as a default. I > appreciate any suggestions of how to go about this issue. This seems to be > a outstanding issue from people using Numpy. > I'm not sure what you mean by "the original vectors". All multiples of the unit eigenvector are eigenvectors. None have a claim on being "the original vector". Do you have a reference for what you are referring to? It's possible that there are specific procedures that happen to spit out vectors that are eigenvectors but have semantics about the magnitude, but `np.linalg.eig()` does not implement that procedure. The semantics about the magnitude would be supplied by that specific procedure. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Advanced indexing doesn't follow the Principle of least astonishment
On Thu, Dec 29, 2022 at 8:50 AM Diogo Valada wrote: > Hi all, > > New to the mailing list, so I hope I'm creating a discussion in the right > place. > > Am I the only one that thinks that Advanced indexing in numpy doesn't > follow the principle of minimum astonishment? > > for example > > ```python > a = np.random.rand(100, 100) > > a[(2,4)] #this yields the element at [2,4] > a[[2,4]] #this yields the rows at position 2 and 4 > a[1, (2,4)] #this yields the 2nd and 4th elements of row 1. (So actually > does advanced indexing) > a[1, [2,4]] # Works the same way as the previous one. > ``` > > Worst of all, it's very easy for someone do a mistake and not notice it: > it seems to me that the first method, a[(2,4)], should not be allowed, and > instead only a[*(2,4)] should work. How checked how it works in Julia > (which has a similar syntax), and a[(2,4)] would yield an error, which > makes sense to me. Could it be an idea to deprecate a[(2,4)]-like usages? > No, that's not possible. In Python syntax, the comma `,` is what creates the tuple, not the parentheses. So `a[(2,4)]` is exactly `a[2, 4]`. `a[2, 4]` translates to `a.__getitem__((2, 4))`. So there's no way for the array to know whether it got `a[2, 4]` or `a[(2, 4)]`. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: How will Google DeepMind's AlphaTensor effect numpy
On Wed, Oct 26, 2022 at 9:31 AM wrote: > Hello! > > I was curious on how AlphaTensor will effect NumPy and other similar > applications considering it has found a way to perform 3x3 matrix > multiplication efficiently. > https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor. > I am not even sure how NumPy does this under the hood, is it 2x2? > > Is anyone working on implementing this 3x3 algorithm for NumPy? Is it too > early, and if so why? Are there any concerns about this algorithm? > numpy links against accelerated linear algebra libraries like OpenBLAS and Intel MKL to provide the matrix multiplication. If they find that the AlphaTensor results are better than the options they currently have, then numpy will get them. In general, I think it is unlikely that they will be used. Even the older state of the art that they compare with, like Strassen's algorithm, are not often used in practice. Concerns like memory movement and the ability to use instruction-level parallelism on each kind of CPU tend to dominate over a marginal change in the number of multiplication operations. The answers to this StackOverflow question give some more information: https://stackoverflow.com/questions/1303182/how-does-blas-get-such-extreme-performance -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: An extension of the .npy file format
On Thu, Aug 25, 2022 at 3:47 PM Qianqian Fang wrote: > On 8/25/22 12:25, Robert Kern wrote: > > I don't quite know what this means. My installed version of `jq`, for > example, doesn't seem to know what to do with these files. > > ❯ jq --version > jq-1.6 > > ❯ jq . eye5chunk_bjd_raw.jdb > parse error: Invalid numeric literal at line 1, column 38 > > > the .jdb files are binary JSON files (specifically BJData) that jq does > not currently support; to save as text-based JSON, you change the suffix to > .json or .jdt - it results in ~33% increase compared to the binary due to > base64 > > Okay. Given your wording, it looked like you were claiming that the binary JSON was supported by the whole ecosystem. Rather, it seems like you can either get binary encoding OR the ecosystem support, but not both at the same time. > I think a fundamental problem here is that it looks like each element in > the array is delimited. I.e. a `float64` value starts with b'D' then the 8 > IEEE-754 bytes representing the number. When we're talking about > memory-mappability, we are talking about having the on-disk representation > being exactly what it looks like in-memory, all of the IEEE-754 floats > contiguous with each other, so we can use the `np.memmap` `ndarray` > subclass to represent the on-disk data as a first-class array object. This > spec lets us mmap the binary JSON file and manipulate its contents in-place > efficiently, but that's not what is being asked for here. > > there are several BJData-compliant forms to store the same binary array > losslessly. The most memory efficient and disk-mmapable (but not > necessarily disk-efficient) form is to use the ND-array container syntax > <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type> > that BJData spec extended over UBJSON. > > Are any of them supported by a Python BJData implementation? I didn't see any option to get that done in the `bjdata` package you recommended, for example. https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200 -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: An extension of the .npy file format
ment in the array is delimited. I.e. a `float64` value starts with b'D' then the 8 IEEE-754 bytes representing the number. When we're talking about memory-mappability, we are talking about having the on-disk representation being exactly what it looks like in-memory, all of the IEEE-754 floats contiguous with each other, so we can use the `np.memmap` `ndarray` subclass to represent the on-disk data as a first-class array object. This spec lets us mmap the binary JSON file and manipulate its contents in-place efficiently, but that's not what is being asked for here. > 2. UBJSON/BJData natively support append-able root-level records; JSON has > been extensively used in data streaming with appendable nd-json or > concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming) > > > just a quick comparison of output file sizes with a 1000x1000 unitary > diagonal matrix > > # python3 -m pip install jdata bjdata > import numpy as np > import jdata as jd > x = np.eye(1000); *# create a large array* > y = np.vsplit(x, 5);*# split into smaller chunks* > np.save('eye5chunk.npy',y); *# save npy* > jd.save(y, 'eye5chunk_bjd_raw.jdb');*# save as uncompressed bjd* > jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); *# > zlib-compressed bjd* > jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); *# > lzma-compressed bjd* > newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding* > newx = np.concatenate(newy);*# regroup chunks* > newx.dtype > > > here are the output file sizes in bytes: > > 8000128 eye5chunk.npy > 5004297 eye5chunk_bjd_raw.jdb > Just a note: This difference is solely due to a special representation of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a special value and uses the `float32` encoding of it). If you had any other value making up the bulk of the file, this would be larger than the NPY due to the additional delimiter b'D'. > 10338 eye5chunk_bjd_zlib.jdb >2206 eye5chunk_bjd_lzma.jdb > > Qianqian > -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: writing a known-size 1D ndarray serially as it's calced
On Thu, Aug 25, 2022 at 4:27 AM Bill Ross wrote: > Thanks, np.lib.format.open_memmap() works great! With prediction procs > using minimal sys memory, I can get twice as many on GPU, with fewer > optimization warnings. > > Why even have the number of records in the header? Shouldn't record size > plus system-reported/growable file size be enough? > Only in the happy case where there is no corruption. Implicitness is not a virtue in the use cases that the format was designed for. There is an additional use case where the length is unknown a priori where implicitness would help, but the format was not designed for that case (and I'm not sure I want to add that use case). > I'd love to have a shared-mem analog for smaller-scale data; now I load > data and fork to emulate that effect. > There are a number of ways to do that, including using memmap on files on a memory-backed filesystem like /dev/shm/ on Linux. See this article for several more options: https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2 > My file sizes will exceed memory, so I'm hoping to get the most out of > memmap. Will this in-loop assignment to predsum work to avoid loading all > to memory? > > predsum = np.lib.format.open_memmap(outfile, mode='w+', > shape=(ids_sq,), dtype=np.float32) > > for i in range(len(IN_FILES)): > > pred = numpy.lib.format.open_memmap(IN_FILES[i]) > > predsum = np.add(predsum, pred) # <- > This will replace the `predsum` array with a new in-memory array the first time through this loop. Use `out=predsum` to make sure that the output goes into the memory-mapped array np.add(predsum, pred, out=predsum) Or the usual augmented assignment: predsum += pred > del pred > del predsum > The precise memory behavior will depend on your OS's virtual memory configuration. But in general, `np.add()` will go through the arrays in order, causing the virtual memory system to page in memory pages as they are accessed for reading or writing, and page out the old ones to make room for the new pages. Linux, in my experience, isn't always the best at managing that backlog of old pages, especially if you have multiple processes doing similar kinds of things (in the past, I have seen *each* of those processes trying to use *all* of the main memory for their backlog of old pages), but there are configuration tweaks that you can make. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: writing a known-size 1D ndarray serially as it's calced
On Tue, Aug 23, 2022 at 8:47 PM wrote: > I want to calc multiple ndarrays at once and lack memory, so want to write > in chunks (here sized to GPU batch capacity). It seems there should be an > interface to write the header, then write a number of elements cyclically, > then add any closing rubric and close the file. > > Is it as simple as lib.format.write_array_header_2_0(fp, d) > then writing multiple shape(N,) arrays of float by > fp.write(item.tobytes())? > `item.tofile(fp)` is more efficient, but yes, that's the basic scheme. There is no footer after the data. The alternative is to use `np.lib.format.open_memmap(filename, mode='w+', dtype=dtype, shape=shape)`, then assign slices sequentially to the returned memory-mapped array. A memory-mapped array is usually going to be friendlier to whatever memory limits you are running into than a nominally "in-memory" array. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: generate a random sample based on independent priors in `random.choice` #22082
As discussed on the corresponding Github issue[1], this is not going to be functionality that we add onto `choice()` nor is it likely something that we will add as a separate `Generator` method. The desired computation is straightforward to do by composing existing functionality. [1] https://github.com/numpy/numpy/issues/22082 On Thu, Aug 4, 2022 at 2:07 PM Rodo-Singh wrote: > Proposed new feature or change: > Objective: Sample elements from the given iterator (a list, a numpy array, > etc.) based upon pre-defined probabilities associated with each element > which may not sums upto 1. > > Overview > * In the numpy.random.choice function the cases where we're explicitly > providing the list of probabilistic values for an input sample, the > requirement expects the sum of the whole list to be 1. This makes sense > when all the elements are possible observation for a single random variable > whose pmf(probability mass function) is nothing but the p list. > * But when every element in that list can be regarded as observation of > separate independent Bernoulli r.v. (random variable), then the problem > falls into the category of multi-label. Intuitively speaking, for each > element in the list or array given to us, we'll toss a coin (may be biased > or unbiased) ~B(p) (i.e., follows Bernoulli with p as success probability > for head). > * The output array or list would probably be a proper subset or can be a > full list or can be an empty one. > * Plus, here the argument size should automatically get shut down as we > just want which all elements got selected into the output list after n coin > tossess (where len(list) = n). Also it may happen that each element is > independent as argued above, but the sum(p) = 1. Then we can probably put > an extra argument independence = True or independence = False. > ___ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: robert.k...@gmail.com > -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: seed for new random number generator
On Tue, Jun 21, 2022 at 9:54 AM Pieter Eendebak wrote: > Hi Robert, > > Thanks for the information and the offer. My use case is development of > algorithms where I want to be able to test variations of the algorithm > while keeping the same (simulated or generated) data. Related to that are > example scripts part of the documentation that I want to always have the > same result. > > Creating a generator and passing that (someway or the other) is a good > approach, but it requires some refactoring of the code. Especially for > third-party code I would like to avoid this (to be honest, most external > code is still using the old numpy random number interface, so there we can > use the old style of setting the seed) > Ah, okay, so you're replacing the old global with a new global Generator instance in your own code as an easier transitional step. # global instance rng = np.random.default_rng() def seed_global_rng(seed=None): # Create the same type of BitGenerator with the given seed, and then copy its state over to the global. bg_cls = type(rng.bit_generator) bg = bg_cls(seed) rng.bit_generator.state = bg.state I do recommend moving to passing around a Generator when you can; even in the older system, we've always recommended passing around a RandomState instance. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: seed for new random number generator
On Sun, Jun 19, 2022 at 9:37 AM Pieter Eendebak wrote: > Hi everyone, > > The new numpy random interface (e.g. r=numpy.random.default_rng; r.random) > is much faster than the old one (e.g. np.random.random). When converting > code from the old style to the new style I miss having a way to set the > seed of the RNG > > I tried: > > rng.bit_generator = np.random.PCG64(seed=42) # fails, with good error > message > rng.bit_generator.state['state']['state']=42 # has no effect, perhaps make > this dict read-only? > > Is there a way to set the seed without creating a new RNG object? > We generally recommend just creating a new Generator and passing that around in almost all cases. Whenever that can possibly be made to work, please do that. The use of np.random.seed() is usually a code smell indicating that someone was working around the fact that there was just the one global underneath np.random.random() et al. When you don't have the constraint of a single global instance, it's almost always going to be better to use multiple instances; you don't need that workaround anymore. There are ways to punch in a new BitGenerator into an existing Generator, if you must, and also ways to punch in a state dict into the BitGenerator, but these are largely for various meta-programming tasks like serialization rather than day-to-day use of pseudorandom numbers. If you are converting old code that happened to use np.random.seed(), it is almost certainly not one of these tasks. If you want to describe your use case more specifically, I can give you some more guidance on good patterns to replace it with the new system. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: ndarray shape permutation
On Tue, May 17, 2022 at 10:36 PM Joseph Fox-Rabinovitz < jfoxrabinov...@gmail.com> wrote: > You could easily write an extension to ndarray that maps axis names to > indices and vice versa. > Indeed. There have been several over the years. The current most-maintained one is xarray, near as I can tell. https://xarray.pydata.org/en/stable/index.html -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: ENH: Support Random Unit Vector
On Tue, May 17, 2022 at 5:20 AM wrote: > ### Proposed new feature or change: > > Description > > > Often you want to randomize a "direction" that doesn't have "size". This > can be useful when: > * You want to randomize the direction of an NPC walk developing in a 2D or > 3D games, where its speed is a predefined constant. > * You're developing particles simulation with all particles moving in > random directions > * etc. > > This random direction should be an `n`th dimensional unit vector which is > randomize **uniformly** from the unit sphere. That means that sections of > the unit sphere with equal areas should have the same chance of getting a > vector from. > > Implementation > === > > If one wants to randomize `n`th dimensional direction uniformly, one must > do the following: > 1. Generate `n` components of the vector using the standard normal > distribution > 2. Find the norm the generated vector > 3. Divide the vector by its norm > > In simple Python code: > ```python > import numpy as np > > > def sphere_uniform(n: int): > vec = np.random.normal(size=n) > norm = np.sqrt(np.sum(vec ** 2)) > return vec / norm > ``` > > This implementation suggestion is based on [this blog post]( > https://towardsdatascience.com/the-best-way-to-pick-a-unit-vector-7bd0cc54f9b). > I'm not sure if there's a faster way of implementing it using C++ and > Cython, but I'm confident that someone here will know. > I would probably use np.linalg.norm() to calculate the norm, but mostly for readability. > Why Implement It in Numpy? > === > > I believe that random unit vectors are common enough to be a part of > Numpy. Scientists, gamers and engineers use random directions all the time, > and I believe there is no reason why Numpy won't provide this ability. > As a general policy, we do not plan on adding more methods to Generator (c.f. discussion of a previous suggested addition[1]). This would be a good multivariate distribution to add to scipy.stats, though. [1] https://www.mail-archive.com/numpy-discussion@python.org/msg04724.html -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Discussion: History & Possible Deprecation/Removal of numpy.linalg.{eig(), eigvals()}?
On Tue, May 3, 2022 at 8:40 PM Leo Fang wrote: > Hi, I am catching up with an assigned task that I slipped. I am wondering > a few things about the numpy.linalg.eig() API (eigvals() is also included > in this discussion, but for simplicity let's focus on eig()). AFAIU the > purpose of eig() is for handling general eigen-systems which have left and > right eigenvectors (that could differ beyond transpose/conjugate). For this, I have two questions: > > 1. What's the history of NumPy adding this eig() routine? My guess would > be NumPy added it because Matlab had it, but I couldn't tell from a quick > commit search. > It essentially dates back to Numeric, though the name was a bit different. But the core feature of only computing right-eigenvectors was there. scipy.linalg.eig() came along and expose both left- and right-eigenvalues (with close to the present signature, I think). Then numpy took over from Numeric and adopted the scipy.linalg names for the routines, but cut out a number of the features that made some of the interfaces complicated. Since Numeric only did right-eigenvalues (what most people want, even in the general-matrix case), so did numpy. > 2. Would it be possible to deprecate/remove this API? From past > discussions with users and developers, I found: > - numpy.linalg.eig() only returns the right eigenvectors, so it's not very > useful: > * When the left eigenvectors are not needed --- which is the majority of > the use cases --- eigh() is the right routine, but most users (especially > those transitioning from Matlab) don't know it's there > eigh() is only for symmetric/Hermitian matrices. > * When the left eigenvectors are needed, users have to use > scipy.linalg.eig() anyway > It seems to me that you are assuming that in the non-Hermitian case, the right-eigenvectors are useless without the left-eigenvectors. It isn't obvious to me why that would be the case. It is true that the left- and right-eigenvectors differ from each other, but it's not clear to me that in every application with a general matrix I need to compute both. - A general support for eig() could be challenging (numerical issues), and > doing SVDs instead is usually the better way out (in terms of both > numerical stability and mathematical sense). > Computing the left-eigenvectors is no real problem. The underlying _GEEV LAPACK routine can do it if asked. It's the same routine under scipy.linalg.eig(). -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Numpy array
On Thu, Jan 27, 2022 at 8:18 AM wrote: > Hi, i am new to numpy. This is first time i am using numpy. > > https://github.com/mspieg/dynamical-systems/blob/master/Bifurcations.ipynb > > This code i found to create bifurcation graph. There is section of code > which i am not able to understand > > vr[stable], vx[stable] > > vr and vx is array of 150 elements. > unstable and stable array also has 150 elements > > But vr[stable], vx[stable] becomes 75 elements in the array. Which i am > not able to umderstand how 150 elements in array drops to 75 > `stable` is a boolean array which is True where `f_x(vr, vx) < 0` and False otherwise. When a boolean array is used as an index, the resulting array only has the elements where the boolean value in the index array is True and dropping all of the elements where it is False. That is why the number of elements changes. https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Fit 2D points to closed curve
On Thu, Jan 20, 2022 at 1:15 PM wrote: > Hi, > I use a CAD package called Rhino which lets users write python scripts > that run in the cad environment. I have a closed curve that is a design > surface border and I have a sparse data set of 2D surveyed points taken > around that border, so some error is included in the 2D data set. I would > like to know what python function I could use to do a 2D best fit of the > points to the curve. > Plenty of examples of fitting curves to points but not of best fitting > points to a closed curve of which I do not know the formula. > You can use a parametric spline (where X and Y are each functions of a "time" coordinate `t`) with periodic boundary conditions. See this StackOverflow answer for an example using scipy.interpolate: https://stackoverflow.com/a/31466013/163633 Whether our spline representation can be converted to Rhino NURBS or if Rhino itself can do a periodic parametric NURBS and do the fitting itself, I don't know. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: np.convolve question
On Thu, Jan 13, 2022 at 10:47 AM wrote: > Hello, > > I am a mechanical engineer and am using Numpy Convolve to smoothen 1D data > collected from physical testing for a diesel engine. (I found convolve the > best after comparing with other smoothing methodologies) > I tried reading the numpy manual and the wikipedia article referenced in > it. > But I would specifically like to ask/discuss what would happen if the > second 1D input is not given? The second input is required, which is evident from the docstring and could be ascertained by just trying it. Are you instead asking rather if np.convolve() could be modified to not have to take the second argument? If so, no, I don't think we would. There are no sensible defaults. Convolution is inherently a very general operation that combines two inputs to achieve a variety of effects, not just smoothing. what type of window will the function pass the data through to smoothen it? There are several ways to build a low-pass (i.e. smoothing) filter. My favorite first choice is the Savitzky-Golay filter as mentioned several times in the StackOverflow answers that you link to. It constructs a low-pass filter window (i.e. the second 1D input) according to some simple, understandable design rules and then uses convolution. I recommend starting with `scipy.signal.savgol_filter()`. If you need to reuse the same filter window for lots of signals, you can use `scipy.signal.savgol_coeffs()` to get that window first, and then use `np.convolve()` explicitly. Here are the articles I have read to implement my smoothing code: > References: > 1. https://numpy.org/doc/stable/reference/generated/numpy.convolve.html > 2. https://en.wikipedia.org/wiki/Convolution > 3. > https://stackoverflow.com/questions/20618804/how-to-smooth-a-curve-in-the-right-way -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: representation of valid float type range
On Thu, Jan 6, 2022 at 4:43 PM wrote: > Thanks for your answer. > > I think i understand it - is it that `f64_info.max - f64_info.min` does > not fit in float64? because it approximates `2 * f64_info.max`? > Well, it "fits" into a float64 by becoming `inf`, which is a valid float64 value, just one that has particular consequences. In that case, I agree with Klaus, linspace should be able to handle this? > I don't particularly agree that linspace() ought to add special-case logic to handle this. It would be difficult to write logic that reliably recognizes all the ways that something like this is actually the case and then does something sensible to recover from it. Lev showed how you can manually do something sorta reasonable for `np.linspace(f64_info.min, f64_info.max, num)` because the endpoints are symmetric around 0, but there is nothing that can really be done for the asymmetric case. Floating point arithmetic has its limits, and playing with values near the boundaries is going to make you run into those limits. I would rather have linspace() do something consistent that gives non-useful results in these boundary edge cases than try to do a bunch of different logic in these outer limits. The utility of getting "sensible" results for the extreme results is fairly limited (not least because any downstream computation will also be very likely to generate NaNs and infs), so I would happily trade that to have a more consistent mental model about what linspace() does. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Proposal - Making ndarray object JSON serializable via standardized JData annotations
On Thu, Nov 25, 2021 at 11:58 PM Qianqian Fang wrote: > from my limited experience, as long as the "TypeError" from json keeps > popping up, requests like this ( #12481, #16432, #18994, > pallets/flask#4012, openmm/openmm#3202, zarr-developers/zarr-python#354) > will unlikely to cease (and maintainers will have to keep on closing with > "wontfix") - after all, no matter how different the format expectations a > user may have, seeing some sort of default behavior is still a lot more > satisfying than seeing an error. > I agree. Unfortunately, we have no control over that inside of numpy. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Proposal - Making ndarray object JSON serializable via standardized JData annotations
On Thu, Nov 25, 2021 at 10:21 PM Qianqian Fang wrote: > On 11/25/21 17:05, Stephan Hoyer wrote: > > Hi Qianqian, > > What is your concrete proposal for NumPy here? > > Are you suggesting new methods or functions like to_json/from_json in > NumPy itself? > > > that would work - either define a subclass of JSONEncoder to serialize > ndarray and allow users to pass it to cls in json.dump, or, as you > mentioned, define to_json/from_json like pandas DataFrame > <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html> > would save people from writing customized codes/formats. > > I am also wondering if there is a more automated way to tell > json.dump/dumps to use a default serializer for ndarray without using > cls=...? I saw a SO post mentioned about a method called "__serialize__" in > a class, but can't find it in the official doc. I am wondering if anyone is > aware of the method defining a default json serializer in an object? > There isn't one. You have to explicitly provide the JSONEncoder. Which is why there is nothing that we can really do in numpy to avoid the TypeError that you mention below. The stdlib json module just doesn't give us the hooks to be able to do that. We can provide top-level functions like to_json()/from_json() to encode/decode a top-level ndarray to a JSON text, but that doesn't help with ndarrays in dicts or other objects. We could also provide a JSONEncoder/JSONDecoder pair, too, but as I mention in one of the Github issues you link to, there are a number of different expectations that people could have for what the JSON representation of an array is. Some will want to use the JData standard. Others might just want the arrays to be represented as lists of lists of plain-old JSON numbers in order to talk with software in other languages that have no particular standard for array data. > As far as I can tell, reading/writing in your custom JSON format already > works with your jdata library. > > ideally, I was hoping the small jdata encoder/decoder functions can be > integrated into numpy; it can help avoid the "TypeError: Object of type > ndarray is not JSON serializable" in json.dump/dumps without needing > additional modules; more importantly, it simplifies users experience in > exchanging complex arrays (complex valued, sparse, special shapes) with > other programming environments. > It seems to me that the jdata package is the right place for implementing the JData standard. I'm happy for our documentation to point to it in all the places that we talk about serialization of arrays. If the json module did have some way for us to specify a default representation for our objects, then that would be a different matter. But for the present circumstances, I'm not seeing a substantial benefit to moving this code inside of numpy. Outside of numpy, you can evolve the JData standard at its own pace. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: dtype=(bool) vs dtype=bool
On Tue, Oct 19, 2021 at 8:54 PM Hongyi Zhao wrote: > On Wed, Oct 20, 2021 at 8:29 AM Robert Kern wrote: > > > > On Tue, Oct 19, 2021 at 8:22 PM wrote: > >> > >> Do I have to use it this way? > > > > > > Nothing is forcing you to, but everyone else will write it as > `dtype=bool`, not `dtype=(bool)`. `dtype=(bool)` is perfectly > syntactically-valid Python. It's just not idiomatic, so readers of your > code will wonder why you wrote it that way and if you meant something else > and will have trouble reading your code. > > I use Emacs, and find that python-lsp-server will give the dtype=() as > the completion result, see here [1] for more detailed discussion. > > [1] https://github.com/python-lsp/python-lsp-server/issues/98 Well, that's not helpful of python-lsp-server. Is it looking at our type annotations? The annotations for `dtype=` arguments are likely quite complicated because we accept so many different types here, but `()` is not a great completion. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: dtype=(bool) vs dtype=bool
On Tue, Oct 19, 2021 at 8:22 PM wrote: > Do I have to use it this way? > Nothing is forcing you to, but everyone else will write it as `dtype=bool`, not `dtype=(bool)`. `dtype=(bool)` is perfectly syntactically-valid Python. It's just not idiomatic, so readers of your code will wonder why you wrote it that way and if you meant something else and will have trouble reading your code. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: The source code corresponding to numpy.invert.
On Mon, Oct 4, 2021 at 5:17 AM Hongyi Zhao wrote: > > > That’s just the way Python’s syntax works. Operators are not names that > can be resolved to objects that can be compared with the `is` operator. > Instead, when that operator is evaluated in an expression, the Python > interpreter will look up a specially-named method on the operand object (in > this case `__invert__`). Numpy array objects implement this method using > `np.invert`. > > If so, which is symlink to which, I mean, which is the original name, > and which is an alias? > "symlink" and "alias" are probably not the best analogies. The implementation of `np.ndarry.__invert__` simply calls `np.invert` to do the actual computation. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: The source code corresponding to numpy.invert.
On Mon, Oct 4, 2021 at 1:09 AM wrote: > Thank you for pointing this out. This is the code block which includes the > first appearance of the keyword `logical_not`. > > BTW, why can't the ~ operator be tested equal to 'np.invert', as shown > below: > > ``` > In [1]: import numpy as np > In [3]: np.invert is np.bitwise_not > Out[3]: True > > In [4]: np.invert is ~ > File "", line 1 > np.invert is ~ > ^ > SyntaxError: invalid syntax > ``` > That’s just the way Python’s syntax works. Operators are not names that can be resolved to objects that can be compared with the `is` operator. Instead, when that operator is evaluated in an expression, the Python interpreter will look up a specially-named method on the operand object (in this case `__invert__`). Numpy array objects implement this method using `np.invert`. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: The source code corresponding to numpy.invert.
On Mon, Oct 4, 2021 at 12:09 AM wrote: > > (the bool implementation uses the `logical_not` loop). > > Do you the following code snippet: > > > https://github.com/numpy/numpy/blob/3c1e9b4717b2eb33a2bf2d495006bc300f5b8765/numpy/core/src/umath/loops.c.src#L1627-L1633 This is the one that gets expaneded to `BOOL_logical_not`: https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/loops.c.src#L493-L510 -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: The source code corresponding to numpy.invert.
On Sun, Oct 3, 2021 at 9:27 PM wrote: > > So, C/Python operator `~` has been overridden by the corresponding user > function in numpy, but where is the corresponding source code > implementation? > ufuncs are implemented in C. We provide so-called loop functions that iterate over contiguous segments of the operand arrays. We use a custom code generator to make implementations for all of the types. Here are the ones that correspond to `np.invert`. (the bool implementation uses the `logical_not` loop). https://github.com/numpy/numpy/blob/main/numpy/core/src/umath/loops.c.src#L612-L626 -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: spam on the mailing lists
On Wed, Sep 29, 2021 at 3:35 AM Andras Deak wrote: > Hi All, > > Today both of the python.org mailing lists I'm subscribed to (numpy and > scipy-dev) got the same kind of link shortener spam. I assume all the > mailing lists started getting these, and that these won't go away for a > while. > Given that we've had a literal order of magnitude more messages about the spam than the spam itself, maybe it's just a blip? I will suggest that spam management is probably not a strong, much less a decisive, argument for migrating to a new discussion forum. -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
[Numpy-discussion] Re: Running numpy.test() after pip install
On Wed, Sep 29, 2021 at 3:50 PM Ralf Gommers wrote: > > On Tue, Sep 28, 2021 at 10:39 PM Aaron Meurer wrote: > >> Could numpy include a variant (I'm not sure what setuptools calls >> this) so that 'pip install numpy[tests]' installs those extra >> dependencies? >> > > I'd prefer not to, those things aren't great for maintenance/testing, and > if you have to read the docs to know how to spell `numpy[tests]` you may > just as well write it `pip install numpy pytest hypothesis`. There's also > no one set of extra dependencies people will want - what about running > benchmarks, optional test dependencies, etc.? > I think Bennet's request to add this to the `long_description` that shows up on PyPI is eminently reasonable, though. >> If adding them as dependencies seems to heavy weight, or is otherwised >> >> deemed undesirable, perhaps just a note in the Project Description at >> >> https://pypi.org/project/numpy/ to say that, if you want to run tests, >> >> those two packages will be needed? >> >> >> >> Thanks,-- bennet > > -- Robert Kern ___ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com
Re: [Numpy-discussion] SeedSequence.spawn()
On Sun, Aug 29, 2021 at 6:58 AM Stig Korsnes wrote: > Thanks again Robert! > Got rid of dict(state). > > Not sure I followed you completely on the test case. > In the code that you showed, you were pulling out and storing the `.state` dict and then punching that back into a single `Generator` instance. Instead, you can just make the ~200-1000 `Generator` instances. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SeedSequence.spawn()
On Sat, Aug 28, 2021 at 5:56 AM Stig Korsnes wrote: > Thank you again Robert. > I am using NamedTuple for mye keys, which also are keys in a dictionary. > Each key will be unique (tuple on distinct int and enum), so I am thinking > maybe the risk of producing duplicate hash is not present, but could as > always be wrong :) > Present, but possibly ignorably small. 128-bit spaces give enough breathing room for me to be comfortable; 64-bit spaces like what hash() will use for its results makes me just a little claustrophobic. If the structure of the keys is pretty fixed, just these two integers (counting the enum as an integer), then I might just use both in the seeding material. def get_key_seed(key:ComponentId, root_seed:int): return np.random.SeedSequence([key.the_int, int(key.the_enum), root_seed]) > For positive ints i followed this tip > https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function > , and did: > > def stronghash(key:ComponentId): > return ctypes.c_size_t(hash(key)).value > np.uint64(possibly_negative_integer) will also work for this purpose (somewhat more reliably). Since I will be using each process/random sample several times, and keeping > all of them in memory at once is not feasible (dimensionality) i did the > following: > > self._rng = default_rng(cs) > self._state = dict(self._rng.bit_generator.state) # > > def scenarios(self) -> npt.NDArray[np.float64]: > self._rng.bit_generator.state = self._state > > return > > Would you consider this bad practice, or an ok solution? > It's what that property is there for. No need to copy; `.state` creates a new dict each time. In a quick test, I measured a process with 1 million Generator instances to use ~1.5 GiB while 1 million state dicts ~1.0 GiB (including all of the other overhead of Python and numpy; not a scientific test). Storing just the BitGenerator is half-way in between. That's something, but not a huge win. If that is really crossing the border from feasible to infeasible, you may be about to run into your limits anyways for other reasons. So balance that out with the complications of swapping state in and out of a single instance. I Norway we have a saying which directly translates :" He asked for the > finger... and took the whole arm" . > Well, when I craft an overly-complicated system, I feel responsible to help shepherd people along in using it well. :-) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SeedSequence.spawn()
joblib is a library that uses clever caching of function call results to make the development of certain kinds of data-heavy computational pipelines easier. In order to derive the key to be used to check the cache, joblib has to look at the arguments passed to the function, which may involve usually-nonhashable things like large numpy arrays. https://joblib.readthedocs.io/en/latest/ So they constructed joblib.hash() which basically takes the arguments, pickles them into a bytestring (with some implementation details), then computes an MD5 hash on that. It's probably overkill for your keys, but it's easily available and quite generic. It returns a hex-encoded string of the 128-bit MD5 hash. `int(..., 16)` will convert that to a non-negative (almost-certainly positive!) integer that can be fed into SeedSequence. On Fri, Aug 27, 2021 at 5:03 AM Stig Korsnes wrote: > Thank you Robert! > This scheme fits perfectly into what I`m trying to accomplish! :) The > "smooshing" of ints by supplying a list of ints had eluded me. Thank you > also for the pointer about built-in hash(). I would not be able to rely on > it anyways, because it does not return strictly positive ints which > SeedSequence requires. If you have a minute to spare: Could you briefly > explain "int(joblib.hash(key) > <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>, > 16)" , and would this always return non-negative integers? > Thanks again! > > tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern : > >> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes >> wrote: >> >>> Hi, >>> Is there a way to uniquely spawn child seeds? >>> I`m doing monte carlo analysis, where I have n random processes, each >>> with their own generator. >>> All process models instantiate a generator with default_rng(). I.e >>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the >>> problem I`m facing, is that results using individual process depends on >>> the order of the process initialization ,and the number of processes used. >>> However, if I could spawn children with a unique identifier, I would be >>> able to reproduce my individual results without having to pickle/log >>> states. For example, all my models have an id (tuple) field which is >>> hashable. >>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects >>> support hash(object), I would have reproducibility for all my processes. I >>> could do without the spawning, but then I would probably loose independence >>> when I do multiproc? Is there a way to achieve my goal in the current >>> version 1.21 of numpy? >>> >> >> I would probably not rely on `hash()` as it is only intended to be pretty >> good at getting distinct values from distinct inputs. If you can combine >> the tuple objects into a string of bytes in a reliable, collision-free way >> and use one of the cryptographic hashes to get them down to a 128bit >> number, that'd be ideal. `int(joblib.hash(key) >> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>, >> 16)` should do nicely. You can combine that with your main process's seed >> easily. SeedSequence can take arbitrary amounts of integer data and smoosh >> them all together. The spawning functionality builds off of that, but you >> can also just manually pass in lists of integers. >> >> Let's call that function `stronghash()`. Let's call your main process >> seed number `seed` (this is the thing that the user can set on the >> command-line or something you get from `secrets.randbits(128)` if you need >> a fresh one). Let's call the unique tuple `key`. You can build the >> `SeedSequence` for each job according to the `key` like so: >> >> root_ss = SeedSequence(seed) >> for key, data in jobs: >> child_ss = SeedSequence([stronghash(key), seed]) >> submit_job(key, data, seed=child_ss) >> >> Now each job will get its own unique stream regardless of the order the >> job is assigned. When the user reruns it with the same root `seed`, they >> will get the same results. When the user chooses a different `seed`, they >> will get another set of results (this is why you don't want to just use >> `SeedSequence(stronghash(key))` all by itself). >> >> I put the job-specific seed data ahead of the main program's seed to be >> on the super-safe side. The spawning mechanism will append integers to the >> end, so there's a super-tiny chance somewhere down a long line of >> `root_ss.spawn()`s that there would be a collision (and I mean >> super-extra-tiny). But best practices cost nothing. >> >> I hope that helps and is not too confusing! >> >> -- >> Robert Kern >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] SeedSequence.spawn()
On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes wrote: > Hi, > Is there a way to uniquely spawn child seeds? > I`m doing monte carlo analysis, where I have n random processes, each with > their own generator. > All process models instantiate a generator with default_rng(). I.e > ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the > problem I`m facing, is that results using individual process depends on > the order of the process initialization ,and the number of processes used. > However, if I could spawn children with a unique identifier, I would be > able to reproduce my individual results without having to pickle/log > states. For example, all my models have an id (tuple) field which is > hashable. > If I had the ability to SeedSequence(x).Spawn([objects]) where objects > support hash(object), I would have reproducibility for all my processes. I > could do without the spawning, but then I would probably loose independence > when I do multiproc? Is there a way to achieve my goal in the current > version 1.21 of numpy? > I would probably not rely on `hash()` as it is only intended to be pretty good at getting distinct values from distinct inputs. If you can combine the tuple objects into a string of bytes in a reliable, collision-free way and use one of the cryptographic hashes to get them down to a 128bit number, that'd be ideal. `int(joblib.hash(key) <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>, 16)` should do nicely. You can combine that with your main process's seed easily. SeedSequence can take arbitrary amounts of integer data and smoosh them all together. The spawning functionality builds off of that, but you can also just manually pass in lists of integers. Let's call that function `stronghash()`. Let's call your main process seed number `seed` (this is the thing that the user can set on the command-line or something you get from `secrets.randbits(128)` if you need a fresh one). Let's call the unique tuple `key`. You can build the `SeedSequence` for each job according to the `key` like so: root_ss = SeedSequence(seed) for key, data in jobs: child_ss = SeedSequence([stronghash(key), seed]) submit_job(key, data, seed=child_ss) Now each job will get its own unique stream regardless of the order the job is assigned. When the user reruns it with the same root `seed`, they will get the same results. When the user chooses a different `seed`, they will get another set of results (this is why you don't want to just use `SeedSequence(stronghash(key))` all by itself). I put the job-specific seed data ahead of the main program's seed to be on the super-safe side. The spawning mechanism will append integers to the end, so there's a super-tiny chance somewhere down a long line of `root_ss.spawn()`s that there would be a collision (and I mean super-extra-tiny). But best practices cost nothing. I hope that helps and is not too confusing! -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sinpi/cospi trigonometric functions
On Wed, Jul 14, 2021 at 12:26 PM Joshua Wilson wrote: > I'll note that SciPy actually does have `sinpi` and `cospi`-they just > happen to be private: > > https://github.com/scipy/scipy/blob/master/scipy/special/functions.json#L58 > https://github.com/scipy/scipy/blob/master/scipy/special/functions.json#L12 > > They are used extensively inside the module though as helpers in other > special functions and have extensive tests of their own: > > > https://github.com/scipy/scipy/blob/master/scipy/special/tests/test_mpmath.py#L533 > > https://github.com/scipy/scipy/blob/master/scipy/special/tests/test_mpmath.py#L547 > > https://github.com/scipy/scipy/blob/master/scipy/special/tests/test_mpmath.py#L960 > > https://github.com/scipy/scipy/blob/master/scipy/special/tests/test_mpmath.py#L1741 > > I have no objections to making them public; the PR is as simple as > removing the underscores and adding a docstring. > Delightful! -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] sinpi/cospi trigonometric functions
On Wed, Jul 14, 2021 at 11:39 AM Tom Programming wrote: > Hi all, > > (I am very new to this mail list so please cut me some slack) > > trigonometric functions like sin(x) are usually implemented as: > > 1. Some very complicated function that does bit twiddling and basically > computes the reminder of x by pi/2. An example in > http://www.netlib.org/fdlibm/e_rem_pio2.c (that calls > http://www.netlib.org/fdlibm/k_rem_pio2.c ). i.e. ~500 lines of branching > C code. The complexity arises in part because for big values of x the > subtraction becomes more and more ill defined, due to x being represented > in binary base to which an irrational number has to subtracted and > consecutive floating point values being more and more apart for higher > absolute values. > 2. A Taylor series for the small values of x, > 3. Plus some manipulation to get the correct branch, deal with subnormal > numbers, deal with -0, etc... > > If we used a function like sinpi(x) = sin(pi*x) part (1) can be greatly > simplified, since it becomes trivial to separate the reminder of the > division by pi/2. There are gains both in the accuracy and the performance. > > In large parts of the code anyways there is a pi inside the argument of > sin since it is common to compute something like sin(2*pi*f*t) etc. So I > wonder if it is feasible to implement those functions in numpy. > > To strengthen my argument I'll note that the IEEE standard, too, defines ( > https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf ) the functions > sinPi, cosPi, tanPi, atanPi, atan2Pi. And there are existing > implementations, for example, in Julia ( > https://github.com/JuliaLang/julia/blob/6aaedecc447e3d8226d5027fb13d0c3cbfbfea2a/base/special/trig.jl#L741-L745 > ) and the Boost C++ Math library ( > https://www.boost.org/doc/libs/1_54_0/boost/math/special_functions/sin_pi.hpp > ) > > And that issue caused by apparently inexact calculations have been raised > in the past in various forums ( > https://stackoverflow.com/questions/20903384/numpy-sinpi-returns-negative-value > https://stackoverflow.com/questions/51425732/how-to-make-sinpi-and-cospi-2-zero > https://www.reddit.com/r/Python/comments/2g99wa/why_does_python_not_make_sinpi_0_just_really/ > ... ) > > PS: to be nitpicky I see that most implementation implement sinpi as > sin(pi*x) for small values of x, i.e. they multiply x by pi and then use > the same coefficients for the Taylor series as the canonical sin. A > multiply instruction could be spared, in my opinion, by storing different > Taylor expansion number coefficients tailored for the sinpi function. It is > not clear to me if it is not done because the performance gain is small, > because I am wrong about something, or because those 6 constants of the > Taylor expansion have a "sacred aura" about them and nobody wants to enter > deeply into this. > The main value of the sinpi(x) formulation is that you can do the reduction on x more accurately than on pi*x (reduce-then-multiply rather than multiply-then-reduce) for people who particularly care about the special locations of half-integer x. sin() and cos() are often not implemented in software, but by CPU instructions, so you don't want to reimplement them. There is likely not a large accuracy win by removing the final multiplication. We do have sindg(), cosdg(), and tandg() in scipy.special that work similarly for inputs in degrees rather than radians. They also follow the reduce-then-multiply strategy. scipy.special would be a good place for sinpi() and friends. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Interface for wrapping random distribution functions (__array_function__, __ufunc__, ?)
On Wed, Jun 30, 2021 at 3:34 PM Yoann Mocquin wrote: > > > Hello there, > > here is a feature request for the possibility to wrap numpy random > distributions to be wrapped by a mechanism like __array_function__ or > __array_ufunc__. You can find the GH issue at : > https://github.com/numpy/numpy/issues/19382. > > > > This post follows [this one](https://github.com/numpy/numpy/issues/18902). > I figured I should open another issue for `np.random.normal`, but this > applies for all distributions I guess. > > ## Feature > > Basicaly, I would like that `numpy.random.*` distributions could trigger > an interface when passed instances from custom classes, "à la" > `__array_ufunc__` or `__array_function__`. Here is an example concept : > The main problem is that these are not functions but methods on a hidden global `RandomState` instance. I haven't kept up fully with all of the efforts in the `__array_*__` proposals, but I do see that methods are explicitly excluded from the `__array_function__` mechanism: https://numpy.org/neps/nep-0018-array-function-protocol.html#non-goals The `RandomState` functionality (and thus all of these aliases in `numpy.random`) are now frozen in functionality (per NEP 19), so we will not be adding this functionality to them. If the `__array_function__` developments eventually work out a good way to wrap methods, then we can think about using that on `Generator`, but I suspect that will not be straightforward. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] EHN: Discusions about 'add numpy.topk'
On Sat, May 29, 2021 at 3:35 PM Daniele Nicolodi wrote: > What does k stand for here? As someone that never encountered this > function before I find both names equally confusing. If I understand > what the function is supposed to be doing, I think largest() would be > much more descriptive. > `k` is the number of elements to return. `largest()` can connote that it's only returning the one largest value. It's fairly typical to include a dummy variable (`k` or `n`) in the name to indicate that the function lets you specify how many you want. See, for example, the stdlib `heapq` module's `nlargest()` function. https://docs.python.org/3/library/heapq.html#heapq.nlargest "top-k" comes from the ML community where this function is used to evaluate classification models (`k` instead of `n` being largely an accident of history, I imagine). In many classification problems, the number of classes is very large, and they are very related to each other. For example, ImageNet has a lot of different dog breeds broken out as separate classes. In order to get a more balanced view of the relative performance of the classification models, you often want to check whether the correct class is in the top 5 classes (or whatever `k` is appropriate) that the model predicted for the example, not just the one class that the model says is the most likely. "5 largest" doesn't really work in the sentences that one usually writes when talking about ML classifiers; they are talking about the 5 classes that are associated with the 5 largest values from the predictor, not the values themselves. So "top k" is what gets used in ML discussions, and that transfers over to the name of the function in ML libraries. It is a top-down reflection of the higher level thing that people want to compute (in that context) rather than a bottom-up description of how the function is manipulating the input, if that makes sense. Either one is a valid way to name things. There is a lot to be said for numpy's domain-agnostic nature that we should prefer the bottom-up description style of naming. However, we are also in the midst of a diversifying ecosystem of array libraries, largely driven by the ML domain, and adopting some of that terminology when we try to enhance our interoperability with those libraries is also a factor to be considered. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Indexing question
On Thu, May 20, 2021 at 1:40 PM CJ Carey wrote: > Or as a one-liner: > > out[np.arange(len(x)), x] = 1 > Ah, right. `x[arange(len(x))]` is a no-op. > If NEP 21 is accepted ( > https://numpy.org/neps/nep-0021-advanced-indexing.html) this would be > even simpler: > > out.vindex[:, x] = 1 > > Was there ever a decision about that NEP? I didn't follow the discussion > too closely at the time. > IIRC, I think there was broad agreement on the final plan as stated in the NEP. I suspect, though, that the general appetite for adding to the array API surface has declined even from its anemic starting point, now that deep learning frameworks with ndarray-mimicking APIs have taken off. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Indexing question
On Thu, May 20, 2021 at 9:47 AM Neal Becker wrote: > This seems like something that can be done with indexing, but I > haven't found the solution. > > out is a 2D array is initialized to zeros. x is a 1D array whose > values correspond to the columns of out. For each row in out, set > out[row,x[row]] = 1. Here is working code: > def orthogonal_mod (x, nbits): > out = np.zeros ((len(x), 1< for e in range (len (x)): > out[e,x[e]] = 1 > return out > > Any idea to do this without an explicit python loop? > i = np.arange(len(x)) j = x[i] out[i, j] = 1 -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] two questions about `choose`
On Mon, Apr 19, 2021 at 12:24 PM bas van beek wrote: > After a taking a closer look at the `np.choose` docs I think we should > change `*choices: npt.ArrayLike` into `choices: Sequence[npt.ArrayLike]`.* > > > > *This would resolve the issue, but it’d also mean that directly passing an > array will be prohibited (as far as type checkers are concerned).* > > *The docs do mention that the outermost container should be a list or > tuple anyway, so I’m not convinced that this new typing restriction would > be a huge loss.* > Why not allow both? -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] two questions about `choose`
On Sat, Apr 17, 2021 at 4:28 PM Kevin Sheppard wrote: > 1. I suppose it only uses the (Native int or int64) dtype since each one > would need a code path to run quickly. > > 2. I would describe this a a bug. I think sequences are converted to > arrays and in this case the conversion is not returning a 2 element object > array but expanding and then concatenation. > No, it's broadcasting the two to a common shape, as documented. https://numpy.org/doc/stable/reference/generated/numpy.choose.html > E.g., >> a = a = (0,1,1,0,0,0,1,1) #binary array >> np.choose(a, (0,range(8)) #array([0, 1, 2, 0, 0, 0, 6, 7]) >> > This is equivalent to `np.choose(a, (np.zeros(8, dtype=int), range(8)))` -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp
On Wed, Apr 14, 2021 at 6:16 PM Andrew Nelson wrote: > On Thu, 15 Apr 2021 at 07:15, Robert Kern wrote: > >> On Wed, Apr 14, 2021 at 4:37 PM Joachim Wuttke >> wrote: >> >>> Regarding numpy, I'd propose a bolder measure: >>> To let savetxt(fname, X, ...) store exactly the same information in >>> compressed and uncompressed files, always invoke gzip with mtime = 0. >>> >> >> I agree. >> > > I might look into making a PR for this. To be clear what would the desired > functionality be: > > 1. Mandatory to have mtime = 0? > > 2. Default mtime = 0, but `np.save*` has an extra `mtime` kwd that allows > to set the mtime? > > 3. Default mtime = time.time(), but `np.save*` has an extra `mtime` kwd > that allows to set the mtime = 0? > > > As Joachim says for testing/git-related purposes it is nice to have > bit-wise unchanged files produced (such that the file-hash is unchanged), > but I can also see that it might be nice to have a modification time when > files contained in a zip file were last changed (e.g. write with numpy, > check/inspect with a different module). Of course with the latter you could > just look at the overall file-write date, they should be the same. > I suspect no one's actually looking at the timestamp inside the file (relevant XKCD comic[1] notwithstanding). I'd lean towards boldness here and just set mtime=0. If someone screams, then we can add in the option for them to get the old functionality. But moving closer to reliably reproducible file formats is a goal worth changing the default behavior. [1] https://xkcd.com/1172/ -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp
On Wed, Apr 14, 2021 at 6:20 PM Derek Homeier < de...@astro.physik.uni-goettingen.de> wrote: > On 14 Apr 2021, at 11:15 pm, Robert Kern wrote: > > > > On Wed, Apr 14, 2021 at 4:37 PM Joachim Wuttke > wrote: > > Regarding numpy, I'd propose a bolder measure: > > To let savetxt(fname, X, ...) store exactly the same information in > > compressed and uncompressed files, always invoke gzip with mtime = 0. > > > > I agree. > > > I would caution though that relying on the checksum or similar of the > compressed data still > does not seem a very robust check of the data itself - the compressed file > would still definitely > change with any change in compression level, and quite possibly with > changes in the > linked compression library (perhaps even a simple version update). > Sure, but in practice, that variability happens more rarely. In particular, I'm not going to see that variability when I'm re-running the same workflow with the same software versions inside of a system like DVC[1] that uses file hashes to determine which parts of the workflow need to get rerun. Whereas, I will see timestamp changes every time and thus invalidating the cache when it's irrelevant. [1] https://dvc.org/ > Shouldn’t you better verify the data buffers themselves? Outside of > Python, the lzma utility > xzcmp for example allows binary comparison of the content of compressed > files independently > from the timestamp or even compression method, and it handles gzip and > bzip2 files as well. > We're often integrating with systems that only know how to compute hashes of files and doesn't try to dig into the semantics of the files any deeper. We're not looking to construct the most provably-correct cache, just make some pragmatic choices that will reduce false cache misses with systems that currently exist and enable *better*, if still imperfect, uses of those systems. > I would like to follow up with a pull request, but I am unable to > > find out how numpy.savetxt is invoking gzip. > > > > `savetxt` uses the abstractions in this module to invoke gzip when the > filename calls for it: > > > > https://github.com/numpy/numpy/blob/main/numpy/lib/_datasource.py#L115 > > > One obstacle would be that this is setting gzip.open as _file_opener, > which does not have the > mtime option; getting this replaced with gzip.GzipFile to work in text > mode would require to > somehow replicate gzip.open’s io.TextIOWrapper wrapping of GzipFile. > That should be easily doable. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] savetxt -> gzip: nondeterministic because of time stamp
On Wed, Apr 14, 2021 at 4:37 PM Joachim Wuttke wrote: > Regarding numpy, I'd propose a bolder measure: > To let savetxt(fname, X, ...) store exactly the same information in > compressed and uncompressed files, always invoke gzip with mtime = 0. > I agree. > I would like to follow up with a pull request, but I am unable to > find out how numpy.savetxt is invoking gzip. > `savetxt` uses the abstractions in this module to invoke gzip when the filename calls for it: https://github.com/numpy/numpy/blob/main/numpy/lib/_datasource.py#L115 -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to get Boolean matrix for similar lists in two different-size numpy arrays of lists
On Sun, Mar 14, 2021 at 3:06 PM Ali Sheikholeslam < sheikholeslam@gmail.com> wrote: > I have written a question in: > > https://stackoverflow.com/questions/66623145/how-to-get-boolean-matrix-for-similar-lists-in-two-different-size-numpy-arrays-o > It was recommended by numpy to send this subject to the mailing lists. > > The question is as follows. I would be appreciated if you could advise me > to solve the problem: > > At first, I write a small example of to lists: > > F = [[1,2,3],[3,2,7],[4,4,1],[5,6,3],[1,3,7]] # (1*5) 5 lists > S = [[1,3,7],[6,8,1],[3,2,7]] # (1*3) 3 lists > > I want to get Boolean matrix for the same 'list's in two F and S: > > [False, True, False, False, True] # (1*5)5 Booleans > for 5 lists of F > > By using IM = reduce(np.in1d, (F, S)) it gives results for each number in > each lists of F: > > [ True True True True True True False False True False True True > True True True] # (1*15) > > By using IM = reduce(np.isin, (F, S)) it gives results for each number in > each lists of F, too, but in another shape: > > [[ True True True] > [ True True True] > [False False True] > [False True True] > [ True True True]] # (5*3) > > The true result will be achieved by code IM = [i in S for i in F] for the > example lists, but when I'm using this code for my two main bigger numpy > arrays of lists: > > > https://drive.google.com/file/d/1YUUdqxRu__9-fhE1542xqei-rjB3HOxX/view?usp=sharing > > numpy array: 3036 lists > > > https://drive.google.com/file/d/1FrggAa-JoxxoRqRs8NVV_F69DdVdiq_m/view?usp=sharing > > numpy array: 300 lists > > It gives wrong answer. For the main files it must give 3036 Boolean, in > which 'True' is only 300 numbers. I didn't understand why this get wrong > answers?? It seems it applied only on the 3rd characters in each lists of > F. It is preferred to use reduce function by the two functions, np.in1d and > np.isin, instead of the last method. How could to solve each of the three > above methods?? > Thank you for providing the data. Can you show a complete, runnable code sample that fails? There are several things that could go wrong here, and we can't be sure which is which without the exact code that you ran. In general, you may well have problems with the floating point data that you are not seeing with your integer examples. FWIW, I would continue to use something like the `IM = [i in S for i in F]` list comprehension for data of this size. You aren't getting any benefit trying to convert to arrays and using our array set operations. They are written for 1D arrays of numbers, not 2D arrays (attempting to treat them as 1D arrays of lists) and won't really work on your data. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] size of arrays
On Sat, Mar 13, 2021 at 4:18 PM wrote: > So is it right that 100 arrays of one element is smaller than one array > with size of 100 elements? > No, typically the opposite is true. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] size of arrays
On Sat, Mar 13, 2021 at 4:02 PM wrote: > Dear colleagues! > > Size of np.float16(1) is 26 > Size of np.float64(1) is 32 > 32 / 26 = 1.23 > Note that `sys.getsizeof()` is returning the size of the given Python object in bytes. `np.float16(1)` and `np.float64(1)` are so-called "numpy scalar objects" that wrap up the raw `float16` (2 bytes) and `float64` (8 bytes) values with the necessary information to make them Python objects. The extra 24 bytes for each is _not_ present for each value when you have `float16` and `float64` arrays of larger lengths. There is still some overhead to make the array of numbers into a Python object, but this does not increase with the number of array elements. This is what you are seeing below when you compute the sizes of the Python objects that are the arrays. The fixed overhead does not increase when you increase the sizes of the arrays. They eventually approach the ideal ratio of 4: `float64` values take up 4 times as many bytes as `float16` values, as the names suggest. The ratio of 1.23 that you get from comparing the scalar objects reflects that the overhead for making a single value into a Python object takes up significantly more memory than the actual single number itself. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Using logfactorial instead of loggamma in random_poisson sampler
On Sat, Mar 6, 2021 at 1:45 PM Warren Weckesser wrote: > At the time, making that change was not a high priority, so I didn't > pursue it. It does make sense to use the logfactorial function there, > and I'd be happy to see it updated, but be aware that making the > change is more work than changing just the function call. > Does it make a big difference? Per NEP 19, even in `Generator`, we do weigh the cost of changing the stream reasonably highly. Improved accuracy is likely worthwhile, but a minor performance improvement is probably not. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ENH: Proposal to add atleast_nd function
On Fri, Feb 12, 2021 at 3:42 PM Ralf Gommers wrote: > > On Fri, Feb 12, 2021 at 9:21 PM Robert Kern wrote: > >> On Fri, Feb 12, 2021 at 1:47 PM Ralf Gommers >> wrote: >> >>> >>> On Fri, Feb 12, 2021 at 7:25 PM Sebastian Berg < >>> sebast...@sipsolutions.net> wrote: >>> >>>> >>>> Right, my initial feeling it that without such context `atleast_3d` is >>>> pretty surprising. So I wonder if we can design `atleast_nd` in a way >>>> that it is explicit about this context. >>>> >>> >>> Agreed. I think such a use case is probably too specific to design a >>> single function for, at least in such a hardcoded way. >>> >> >> That might be an argument for not designing a new one (or at least not >> giving it such a name). Not sure it's a good argument for removing a >> long-standing one. >> > > I agree. I'm not sure deprecating is best. But introducing new > functionality where `nd(pos=3) != 3d` is also not great. > > At the very least, atleast_3d should be better documented. It also is > telling that Juan (a long-time) scikit-image dev doesn't like atleast_3d > and there's very little usage of it in scikit-image. > I'm fairly neutral on atleast_nd(). I think that for n=1 and n=2, you can derive The One Way to Do It from broadcasting semantics, but for n>=3, I'm not sure there's much value in trying to systematize it to a single convention. I think that once you get up to those dimensions, you start to want to have domain-specific semantics. I do agree that, in retrospect, atleast_3d() probably should have been named more specifically. It was of a piece of other conveniences like dstack() that did special things to support channel-last images (and implicitly treat 3D arrays as such). For example, DL frameworks that assemble channeled images into minibatches (with different conventions like BHWC and BCHW), you'd want the n=4 behavior to do different things. I _think_ you'd just want to do those with different functions than a complicated set of arguments to one function. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ENH: Proposal to add atleast_nd function
On Fri, Feb 12, 2021 at 1:47 PM Ralf Gommers wrote: > > On Fri, Feb 12, 2021 at 7:25 PM Sebastian Berg > wrote: > >> On Fri, 2021-02-12 at 10:08 -0500, Robert Kern wrote: >> > On Fri, Feb 12, 2021 at 9:45 AM Joseph Fox-Rabinovitz < >> > jfoxrabinov...@gmail.com> wrote: >> > >> > > >> > > >> > > On Fri, Feb 12, 2021, 09:32 Robert Kern >> > > wrote: >> > > >> > > > On Fri, Feb 12, 2021 at 5:15 AM Eric Wieser < >> > > > wieser.eric+nu...@gmail.com> >> > > > wrote: >> > > > >> > > > > > There might be some linear algebraic reason why those axis >> > > > > > positions >> > > > > make sense, but I’m not aware of it... >> > > > > >> > > > > My guess is that the historical motivation was to allow >> > > > > grayscale `(H, >> > > > > W)` images to be converted into `(H, W, 1)` images so that they >> > > > > can be >> > > > > broadcast against `(H, W, 3)` RGB images. >> > > > > >> > > > >> > > > Correct. If you do introduce atleast_nd(), I'm not sure why you'd >> > > > deprecate and remove the one existing function that *isn't* made >> > > > redundant >> > > > thereby. >> > > > >> > > >> > > `atleast_nd` handles the promotion of 2D to 3D correctly. The `pos` >> > > argument lets you tell it where to put the new axes. What's >> > > unintuitive to >> > > my is that the 1D case gets promoted to from shape `(x,)` to shape >> > > `(1, x, >> > > 1)`. It takes two calls to `atleast_nd` to replicate that behavior. >> > > >> > >> > When thinking about channeled images, the channel axis is not of the >> > same >> > kind as the H and W axes. Really, you tend to want to think about an >> > RGB >> > image as a (H, W) array of colors rather than an (H, W, 3) ndarray of >> > intensity values. As much as possible, you want to treat RGB images >> > similar >> > to (H, W)-shaped grayscale images. Let's say I want to make a >> > separable >> > filter to convolve with my image, that is, we have a 1D filter for >> > each of >> > the H and W axes, and they are repeated for each channel, if RGB. >> > Setting >> > up a separable filter for (H, W) grayscale is straightforward with >> > broadcasting semantics. I can use (ntaps,)-shaped vector for the W >> > axis and >> > (ntaps, 1)-shaped filter for the H axis. Now, when I go to the RGB >> > case, I >> > want the same thing. atleast_3d() adapts those correctly for the (H, >> > W, >> > nchannels) case. >> >> Right, my initial feeling it that without such context `atleast_3d` is >> pretty surprising. So I wonder if we can design `atleast_nd` in a way >> that it is explicit about this context. >> > > Agreed. I think such a use case is probably too specific to design a > single function for, at least in such a hardcoded way. > That might be an argument for not designing a new one (or at least not giving it such a name). Not sure it's a good argument for removing a long-standing one. Broadcasting is a very powerful convention that makes coding with arrays tolerable. It makes some choices (namely, prepending 1s to the shape) to make some common operations with mixed-dimension arrays work "by default". But it doesn't cover all of the desired operations conveniently. atleast_3d() bridges the gap to an important convention for a major use-case of arrays. There's also "channels first" and "channels last" versions of RGB images as > 3-D arrays, and "channels first" is the default in most deep learning > frameworks - so the choice atleast_3d makes is a little outdated by now. > DL frameworks do not constitute the majority of image processing code, which has a very strong channels-last contingent. But nonetheless, the very popular Tensorflow defaults to channels-last. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ENH: Proposal to add atleast_nd function
On Fri, Feb 12, 2021 at 9:45 AM Joseph Fox-Rabinovitz < jfoxrabinov...@gmail.com> wrote: > > > On Fri, Feb 12, 2021, 09:32 Robert Kern wrote: > >> On Fri, Feb 12, 2021 at 5:15 AM Eric Wieser >> wrote: >> >>> > There might be some linear algebraic reason why those axis positions >>> make sense, but I’m not aware of it... >>> >>> My guess is that the historical motivation was to allow grayscale `(H, >>> W)` images to be converted into `(H, W, 1)` images so that they can be >>> broadcast against `(H, W, 3)` RGB images. >>> >> >> Correct. If you do introduce atleast_nd(), I'm not sure why you'd >> deprecate and remove the one existing function that *isn't* made redundant >> thereby. >> > > `atleast_nd` handles the promotion of 2D to 3D correctly. The `pos` > argument lets you tell it where to put the new axes. What's unintuitive to > my is that the 1D case gets promoted to from shape `(x,)` to shape `(1, x, > 1)`. It takes two calls to `atleast_nd` to replicate that behavior. > When thinking about channeled images, the channel axis is not of the same kind as the H and W axes. Really, you tend to want to think about an RGB image as a (H, W) array of colors rather than an (H, W, 3) ndarray of intensity values. As much as possible, you want to treat RGB images similar to (H, W)-shaped grayscale images. Let's say I want to make a separable filter to convolve with my image, that is, we have a 1D filter for each of the H and W axes, and they are repeated for each channel, if RGB. Setting up a separable filter for (H, W) grayscale is straightforward with broadcasting semantics. I can use (ntaps,)-shaped vector for the W axis and (ntaps, 1)-shaped filter for the H axis. Now, when I go to the RGB case, I want the same thing. atleast_3d() adapts those correctly for the (H, W, nchannels) case. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ENH: Proposal to add atleast_nd function
On Fri, Feb 12, 2021 at 5:15 AM Eric Wieser wrote: > > There might be some linear algebraic reason why those axis positions > make sense, but I’m not aware of it... > > My guess is that the historical motivation was to allow grayscale `(H, W)` > images to be converted into `(H, W, 1)` images so that they can be > broadcast against `(H, W, 3)` RGB images. > Correct. If you do introduce atleast_nd(), I'm not sure why you'd deprecate and remove the one existing function that *isn't* made redundant thereby. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about optimizing random_standard_normal
On Mon, Feb 8, 2021 at 12:10 PM Sebastian Berg wrote: > > This type of change should be in the release notes undoubtedly and > likely a `.. versionchanged::` directive in the docstring. > > Maybe the best thing would be to create a single, prominent but brief, > changelog listing all (or almost all) stream changes? (Possibly instead > of documenting it in the individual function as `.. versionchanged::`) > > I am thinking just a table with: > * version changed > * short description > * affected functions > * how the stream changed (if that is ever relevant) > Both are probably useful. Good ideas. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about optimizing random_standard_normal
On Mon, Feb 8, 2021 at 11:38 AM Kevin Sheppard wrote: > That is good news indeed. Seems like a good upgrade, especially given the > breadth of application of normals and the multiple appearances within > distributions.c (e.g., Cauchy). Is there a deprecation for a change like > this? Or is it just a note and new random numbers in the next major? I > think this is the first time a substantially new algo has replaced an > existing one. > Per NEP 19, a change like this is a new feature that can be included in a feature release, like any other feature. I would like to see some more testing on the quality of the sequences more than the ones already quoted. Using Kolmogorov-Smirnov and/or Anderson-Darling tests as well, which should be more thorough. https://github.com/scipy/scipy/blob/master/scipy/stats/tests/test_continuous_basic.py#L604-L620 There are also some subtle issues involved in ziggurat method implementations that go beyond just the marginal distributions. I'm not sure, even, that our current implementation deals with the issues raised in the following paper, but I'd like to do no worse. https://www.doornik.com/research/ziggurat.pdf -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about optimizing random_standard_normal
On Mon, Feb 8, 2021 at 10:53 AM Kevin Sheppard wrote: > My reading is that the first 4 are pure C, presumably using the standard > practice of inclining so as to make the tightest loop possible, and to > allow the compiler to make other optimizations. The final line is what > happens when you replace the existing ziggurat in NumPy with the new one. I > read it this way since it has both “new” and “old” with numpy. If it isn’t > this, then I’m unsure what “new” and “old” could mean in the context of > this thread. > No, these are our benchmarks of `Generator`. `numpy` is testing `RandomState`, which wasn't touched by their contribution. https://github.com/numpy/numpy/blob/master/benchmarks/benchmarks/bench_random.py#L93-L97 https://github.com/numpy/numpy/blob/master/benchmarks/benchmarks/bench_random.py#L123-L124 > I suppose camel-cdr can clarify what was actually done. > But I did run the built-in benchmark: ./runtests.py --bench > bench_random.RNG.time_normal_zig and the results are: > > -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about optimizing random_standard_normal
On Mon, Feb 8, 2021 at 3:05 AM Kevin Sheppard wrote: > If I understand correctly, there is no gain when applying this patch to > Generator. I'm not that surprised that this is the case since the compiler > is much more limited in when it can do in Generator than what it can when > filling a large array directly with functions available for inlining and > unrolling. Again, if I understand correctly I think it will be difficult to > justify breaking the stream for a negligible gain in performance. > Can you explain your understanding of the benchmark results? To me, it looks like nearly a 2x improvement with the faster BitGenerators (our default PCG64 and SFC64). That may or may not worth breaking the stream, but it's far from negligible. > But I did run the built-in benchmark: ./runtests.py --bench >> bench_random.RNG.time_normal_zig and the results are: >> >> >> new old >> PCG64 589±3μs 1.06±0.03ms >> MT19937 985±4μs 1.44±0.01ms >> Philox 981±30μs1.39±0.01ms >> SFC64 508±4μs 900±4μs >> numpy 2.99±0.06ms 2.98±0.01ms # no change for /dev/urandom >> > -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about optimizing random_standard_normal
On Sat, Feb 6, 2021 at 7:27 AM wrote: > Well, I can tell you why it needs to be backward compatible! I use random > numbers fairly frequently, and to unit test them I set a specific seed and > then make sure I get the same answers. > > Hmm, I guess that makes sense. I tried to adjust my algorithms to do the > same thing with the same bit's as the original one, but I couldn't get it > to work. > To be clear, this is not our backwards compatibility policy for the methods that you have modified. Our policy is spelled out here: https://numpy.org/neps/nep-0019-rng-policy.html The TestRandomDist suite of tests were adapted from the older RandomState (which is indeed frozen and not allowed to change algorithms). It's a mix of correctness tests that are valid regardless of the precise algorithm (does this method reject invalid arguments? do degenerate arguments yield the correct constant value?) and actual "has this algorithm changed unexpectedly?" tests. The former are the most valuable, but the latter are useful for testing in cross-platform contexts. Compilers and different CPUs can do naughty things sometimes, and we want the cross-platform differences to be minimal. When you do change an algorithm implementation for Generator, as you have done, you are expected to do thorough tests (offline, not in the unit tests) that it is correctly sampling from the target probability distribution, then once satisfied, change the hard-coded values in TestRandomDist to match whatever you are generating. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] NumPy-Discussion Digest, Vol 171, Issue 39
On Tue, Dec 29, 2020 at 5:35 PM Brian Soto wrote: > I'm still learning proper mailing list etiquette so I'm not sure if this > is where I should respond. > FWIW, if you want to reply to conversations regularly, it's more ergonomic all around to subscribe regularly and not use the digests. But if you only occasionally want to jump in, using the digest is fine. It helps everyone if you change the subject line back to that of the thread that you are replying to and trimming the digest to just the email you are responding to. Thank you! I hope this is helpful. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Addition of new distributions: Polya-gamma
My view is that we will not add more non-uniform distribution (i.e. "named" statistical probability distributions like Polya-Gamma) methods to `Generator`. I think that we might add a couple more methods to handle some more fundamental issues (like sampling from the unit interval with control over whether each boundary is open or closed, maybe one more variation on shuffling) that helps write randomized algorithms. Now that we have the C and Cython APIs which allow one to implement non-uniform distributions in other packages, we strongly encourage that. As I commented on the linked PR, `scipy.stats` would be a reasonable place for a Polya-Gamma sampling function, even if it's not feasible to implement an `rv_continuous` class for it. You have convinced me that the nature of the Polya-Gamma distribution warrants this. The only issue is that scipy still depends on a pre-`Generator` version of numpy. So I recommend implementing this function in your own package with an eye towards contributing it to scipy later. On Sun, Dec 27, 2020 at 6:05 AM Zolisa Bleki wrote: > Hi All, > > I would like to know if Numpy accepts addition of new distributions since > the implementation of the Generator interface. If so, what is the criteria > for a particular distribution to be accepted? The reason why i'm asking is > because I would like to propose adding the Polya-gamma distribution to > numpy, for the following reasons: > > 1) Polya-gamma random variables are commonly used as auxiliary variables > during data augmentation in Bayesian sampling algorithms, which have > wide-spread usage in Statistics and recently, Machine learning. > 2) Since this distribution is mostly useful for random sampling, it since > appropriate to have it in numpy and not projects like scipy [1]. > 3) The only python/C++ implementation of the sampler available is licensed > under GPLv3 which I believe limits copying into packages that choose to use > a different license [2]. > 4) Numpy's random API makes adding the distribution painless. > > I have done preliminary work on this by implementing the distribution > sampler as decribed in [3]; see: > https://github.com/numpy/numpy/compare/master...zoj613:polyagamma . > There is a more efficient sampling algorithm described in a later paper > [4], but I chose not to start with that one unless I know it is worth > investing time in. > > I would appreciate your thoughts on this proposal. > > Regards, > Zolisa > > > Refs: > [1] https://github.com/scipy/scipy/issues/11009 > [2] https://github.com/slinderman/pypolyagamma > [3] https://arxiv.org/pdf/1205.0310v1.pdf > [4] https://arxiv.org/pdf/1405.0506.pdf > > > > Disclaimer - University of Cape Town This email is subject to UCT policies > and email disclaimer published on our website at > http://www.uct.ac.za/main/email-disclaimer or obtainable from +27 21 650 > 9111. If this email is not related to the business of UCT, it is sent by > the sender in an individual capacity. Please report security incidents or > abuse via https://csirt.uct.ac.za/page/report-an-incident.php. > _______ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] locking np.random.Generator in a cython nogil context?
On Thu, Dec 17, 2020 at 9:56 AM Evgeni Burovski wrote: > On Thu, Dec 17, 2020 at 1:01 PM Matti Picus wrote: > > > > > > On 12/17/20 11:47 AM, Evgeni Burovski wrote: > > > Just as a side note, this is not very prominent in the docs, and I'm > > > ready to volunteer to send a doc PR --- I'm only not sure which part > > > of the docs, and would appreciate a pointer. > > > > Maybe here > > > > > https://numpy.org/devdocs/reference/random/bit_generators/index.html#seeding-and-entropy > > > > which is here in the sources > > > > > https://github.com/numpy/numpy/blob/master/doc/source/reference/random/bit_generators/index.rst#seeding-and-entropy > > > > > > And/or in the SeedSequence docstring documentation > > > > > https://numpy.org/devdocs/reference/random/bit_generators/generated/numpy.random.SeedSequence.html#numpy.random.SeedSequence > > > > which is here in the sources > > > > > https://github.com/numpy/numpy/blob/master/numpy/random/bit_generator.pyx#L255 > > > Here's the PR, https://github.com/numpy/numpy/pull/18014 > > Two minor comments, both OT for the PR: > > 1. The recommendation to seed the generators from the OS --- I've been > bitten by exactly this once. That was a rather exotic combination of a > vendor RNG and a batch queueing system, and some of my runs did end up > with identical random streams. Given that the recommendation is what > it is, it probably means that experience is a singular point and it no > longer happens with modern generators. > I suspect the vendor RNG was rolling its own entropy using time. We use `secrets.getrandbits()`, which ultimately uses the best cryptographic entropy source available. And if there is no cryptographic entropy source available, I think we fail hard instead of falling back to less reliable things like time. I'm not entirely sure that's a feature, but it is safe! > 2. Robert's comment that `SeedSequence(..., spawn_key=(num,))` is not > equivalent to `SeedSequence(...).spawn(num)[num]` and that the former > is not recommended. I'm not questioning the recommendation, but then > __repr__ seems to suggest the equivalence: > I was saying that they were equivalent. That's precisely why it's not recommended: it's too easy to do both and get identical streams inadvertently. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] locking np.random.Generator in a cython nogil context?
On Mon, Dec 14, 2020 at 3:27 PM Evgeni Burovski wrote: > > > > I also think that the lock only matters for Multithreaded code not > Multiprocess. I believe the latter pickles and unpickles any Generator > object (and the underying BitGenerator) and so each process has its own > version. Note that when multiprocessing the recommended procedure is to > use spawn() to generate a sequence of BitGenerators and to use a distinct > BitGenerator in each process. If you do this then you are free from the > lock. > > Thanks. Just to confirm: does using SeedSequence spawn_key arg > generate distinct BitGenerators? As in > > cdef class Wrapper(): > def __init__(self, seed): > entropy, num = seed > py_gen = PCG64(SeedSequence(entropy, spawn_key=(spawn_key,))) > self.rng = > py_gen.capsule.PyCapsule_GetPointer(capsule, "BitGenerator")# <--- > this > > cdef Wrapper rng_0 = Wrapper(seed=(123, 0)) > cdef Wrapper rng_1 = Wrapper(seed=(123, 1)) > > And then,of these two objects, do they have distinct BitGenerators? > The code you wrote doesn't work (`spawn_key` is never assigned). I can guess what you meant to write, though, and yes, you would get distinct `BitGenerator`s. However, I do not recommend using `spawn_key` explicitly. The `SeedSequence.spawn()` method internally keeps track of how many children it has spawned and uses that to construct the `spawn_key`s for its subsequent children. If you play around with making your own `spawn_key`s, then the parent `SeedSequence(entropy)` might spawn identical `SeedSequence`s to the ones you constructed. If you don't want to use the `spawn()` API to construct the separate `SeedSequence`s but still want to incorporate some per-process information into the seeds (e.g. the 0 and 1 in your example), then note that a tuple of integers is a valid value for the `entropy` argument. You can have the first item be the same (i.e. per-run information) and the second item be a per-process ID or counter. cdef class Wrapper(): def __init__(self, seed): py_gen = PCG64(SeedSequence(seed)) self.rng = py_gen.capsule.PyCapsule_GetPointer(capsule, "BitGenerator") cdef Wrapper rng_0 = Wrapper(seed=(123, 0)) cdef Wrapper rng_1 = Wrapper(seed=(123, 1)) -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.{bool,float,int} deprecation
On Fri, Dec 11, 2020 at 1:12 PM Aaron Meurer wrote: > On Fri, Dec 11, 2020 at 1:47 AM Eric Wieser > wrote: > > > > > you might want to discuss this with us at the array API standard > > > https://github.com/data-apis/array-api (which is currently in RFC > > > stage). The spec uses bool as the name for the boolean dtype. > > > > I don't fully understand this argument - `np.bool` is already not the > boolean dtype. Either: > > The spec does deviate from what NumPy currently does in some places. > If we wanted to just copy NumPy exactly, there wouldn't be a need for > a specification. I wouldn't take that as a premise. Specifying a subset of the vast existing NumPy API would be a quite valuable specification in its own right. I find the motivation for deviation laid out in the Purpose and Scope <https://data-apis.github.io/array-api/latest/purpose_and_scope.html#introduction> section to be reasonably convincing that deviation might be needed *somewhere*. The question then is, is *this* deviation supporting that stated motivation, or is it taking the opportunity of a redesign to rationalize the names more to our current tastes? Given the mode of adopting the standard (a separate subpackage), that's a reasonable choice to make, but let's be clear about the motivation. I submit that keeping the name `bool_` does not make it any harder for other array APIs to adopt the standard. It's just that few people would design a new API with that name if they were designing a greenfield API. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.{bool,float,int} deprecation
On Wed, Dec 9, 2020 at 4:08 PM Aaron Meurer wrote: > On Wed, Dec 9, 2020 at 9:41 AM Sebastian Berg > wrote: > > > > On Mon, 2020-12-07 at 14:18 -0700, Aaron Meurer wrote: > > > Regarding np.bool specifically, if you want to deprecate this, you > > > might want to discuss this with us at the array API standard > > > https://github.com/data-apis/array-api (which is currently in RFC > > > stage). The spec uses bool as the name for the boolean dtype. > > > > > > Would it make sense for NumPy to change np.bool to just be the > > > boolean > > > dtype object? Unlike int and float, there is no ambiguity with bool, > > > and NumPy clearly doesn't have any issues with shadowing builtin > > > names > > > in its namespace. > > > > We could keep the Python alias around (which for `dtype=` is the same > > as `np.bool_`). > > > > I am not sure I like the idea of immediately shadowing the builtin. > > That is a switch we can avoid flipping (without warning); `np.bool_` > > and `bool` are fairly different beasts? [1] > > NumPy already shadows a lot of builtins, in many cases, in ways that > are incompatible with existing ones. It's not something I would have > done personally, but it's been this way for a long time. > Sometimes, we had the function first before Python added them to the builtins (e.g. sum(), any(), all(), IIRC). I think max() and min() are the main ones that we added after Python did, and we explicitly exclude them from __all__ to avoid clobbering the builtins. Shadowing the types (bool, int, float) historically tended to be more problematic than those functions. The first releases of numpy _did_ have those as the scalar types. That empirically turned out to cause more problems for people than sum() or any(), so we renamed the scalar types to have the trailing underscore. We only left the shadowed names as aliases for the builtins because enough people still had `dtype=np.float` in their code that we didn't want to break. All that said, "from numpy import *" is less common these days. We have been pretty successful at getting people on board with the np campaign. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.{bool,float,int} deprecation
On Sun, Dec 6, 2020 at 12:52 AM Stephan Hoyer wrote: > On Sat, Dec 5, 2020 at 9:24 PM Mark Harfouche > wrote: > >> If the answer is to deprecate >> >> np.int(1) == int(1) >> >> then one can add a warning to the __init__ of the np.int class, but >> continue to subclass the python int class. >> >> It just doesn’t seem worthwhile to to stop people from using dtype=np.int, >> which seem to read: >> >> “I want this to be a numpy integer, not necessarily a python integer”. >> > The problem is that there is assuredly code that inadvertently relies upon > this (mis)feature. > > If we change the behavior of np.int() to create np.int64() objects > instead of int() objects, it is likely to result in breaking some user > code. Even with a prior warning, this breakage may be surprising and very > hard to track down. In contrast, it's much safer to simply remove np.int > entirely, because if users ignore the deprecation they end up with an error. > FWIW (and IIRC), *this* was the original misfeature. `np.int`, `np.bool`, and `np.float` were aliases for their corresponding default scalar types in the first numpy releases. However, too many people were doing `from numpy import *` and covering up the builtins. We renamed these aliases with trailing underscores to avoid that problem, but too many people (even in those early days) still had uses of `dtype=np.int`. Making `np.int is int` was the backwards-compatibility hack. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] New package to speed up ufunc inner loops
On Wed, Nov 4, 2020 at 7:22 PM Aaron Meurer wrote: > > > That's not to say that there isn't clearer language that could be > drafted. The NEP is still in Draft stage. But if you think it could be > clearer, please propose specific edits to the draft. Like with unclear > documentation, it's the person who finds the current docs > insufficient/confusing/unclear that is in the best position to recommend > the language that would have helped them. Collaboration helps. > > I disagree. The best person to write documentation is the person who > actually understands the package. I already noted that I don't > actually understand the actual situation with the trademark, for > instance. > Rather, I meant that the best person to fix confusing language is the person who was confused, after consultation with the authors/experts come to a consensus about what was intended. > I don't really understand why there is pushback for making NEP > clearer. Also "like with unclear documentation", if someone says that > documentation is unclear, you should take their word for it that it > actually is, and improve it, rather than somehow trying to argue that > they actually aren't confused. > I'm not. I'm saying that I don't know how to make it more clear to those people because I'm not experiencing it like they are. The things I could think to add are the same kinds of things that were already stated explicitly in the Abstract, Motivation, and Scope. It seems like Stefan is in the same boat. Authors need editors, but the editor can't just say "rewrite!" I don't know what kind of assumptions and context this hypothetical reader is bringing to this reading that are leading to confusion. Sometimes it's clear, but not for me, here (and more relevantly, Stefan). Do you think this needs a complete revamp? Or just an additional sentence to explicitly state that this does not add additional legal restrictions to the copyright license? -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion