from:"Stephan Hoyer"

Re: [Numpy-discussion] Fortran order in recarray.

2017-02-22 Thread Stephan Hoyer

On Wed, Feb 22, 2017 at 8:57 AM, Alex Rogozhnikov <
alex.rogozhni...@yandex.ru> wrote:

> Pandas may be nice, if you need a report, and you need get it done
> tomorrow. Then you'll throw away the code. When we initially used pandas as
> main data storage in yandex/rep, it looked like an good idea, but a year
> later it was obvious this was a wrong decision. In case when you build data
> pipeline / research that should be working several years later (using some
> other installation by someone else), usage of pandas shall be *minimal*.
>

The pandas development team (myself included) is well aware of these
issues. There are long term plans/hopes to fix this, but there's a lot of
work to be done and some hard choices to make:
https://github.com/pandas-dev/pandas/issues/1
https://github.com/pandas-dev/pandas/issues/13862

 That's why I am looking for a reliable pandas substitute, which should be:

> - completely consistent with numpy and should fail when this wasn't
> implemented / impossible
> - fewer new abstractions, nobody wants to learn 
> one-more-way-to-manipulate-the-data,
> specifically other researchers
> - it may be less convenient for interactive data mungling
>   - in particular, less methods is ok
> - written code should be interpretable, and hardly can be misinterpreted.
> - not super slow, 1-10 gigabytes datasets are a normal situation
>

This has some overlap with our motivations for writing Xarray (
http://xarray.pydata.org), so I encourage you to take a look. It still
might be more complex than you're looking for, but we did try to clean up
the really ambiguous APIs from pandas like indexing.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] __numpy_ufunc__

2017-02-22 Thread Stephan Hoyer

On Wed, Feb 22, 2017 at 6:31 AM, Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> It seems to me entirely logical (but then it would, I suggested it
> before...) that we allow opting out by setting `__array_ufunc__` to
> None; in that case, binops return NotImplemented and ufuncs raise
> errors. (In addtion, or alternatively, one could allow setting
> `__array__` to None, which would generally disable something to be
> turned into an array object).
>

This is indeed appealing, but I recall this was still a point of contention
because it leaves intact two different ways to override arithmetic
involving numpy arrays. Mimicking all this logic on classes designed to
wrap well-behaved array-like classes (e.g., xarray, which can wrap NumPy or
Dask arrays) could be painful -- it's easier to just call np.add and let it
handle all the dispatching rather than also worrying about NotImplemented.
That said, I think the opt-out is probably OK, as long we make it clear
that defining __array_ufunc__ to do arithmetic overriding is the preferred
solution (and provide appropriate Mixin classes to make this easier).

Just to be clear: if __array__ = None but __array_ufunc__ is defined, this
would be a class that defines array-like operations but can't be directly
converted into a NumPy array? For example, a scipy.sparse matrix?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposal to support format

2017-02-14 Thread Stephan Hoyer

On Tue, Feb 14, 2017 at 5:35 PM, Gustav Larsson 
wrote:

> 1. For object arrays, I would default to calling format on each element
>> (your "map principle") rather than raising an error.
>>
>
> I'm glad you brought this up as a possibility. It might be possible, but
> there are some issues that would need to be resolved. First of all, {} and
> {:} always works and gives the same result it currently does. So, this only
> affects the situation where the format spec is non-empty. I think there are
> two main issues:
>
> Heterogeneity: Let's say we have x = np.array([12.3, True, 'string',
> Foo(10)], dtype=np.object). Then, presumably {:.1f} should cause a
> ValueError since the string does not support format type 'f'. This could
> create a lot of ValueError land mines for the user.
>

Things will absolutely break if you try to do complex operations on
in-homogeneously typed arrays. I would put the onus on the user in such a
case.

> For x[:2] however it should work and produce something like [12.3  1.0].
> Note, the "map principle" still can't be strictly true. Let's say we have
> an array with type object and mostly string-like elements. Then {:5s} will
> still not produce exactly {:5s} element-wise, because the string
> representations need to be repr-based inside the array (otherwise it could
> break for newlines and things like that and produce spaces that make the
> boundary between elements ambiguous). This brings me to the next issue.
>

Indeed, this will be a departure from the behavior without a format string,
which just uses repr. In my mind, this is the strongest argument against
using the map principle here, because there is a discontinuous shift
between providing and not providing a format string.

> Str vs. repr: If we have a homogeneous object-array with types Foo and Foo
> implements __format__, it would be great if this worked. However, one issue
> is that Foo.__format__ might return things like newline (or spaces), which
> would break (or confuse) the printed output (unless it is made incredibly
> smart to support "vertical alignment"). This issue is essentially the same
> as for strings in general, which is why they use repr instead. I can think
> of two solutions: 1) Try to sanitize (or repr-ify) the string returned by
> __format__ somehow; 2) Put the responsibility on the user and simply let
> the rendering break if Foo.__format__ does not play well.
>

I wouldn't do anything fancy here to worry about line breaks. It's
basically impossible to get this right for edge cases, so I would certainly
put the responsibility on the user.

On another note, about Python 2 vs 3: I would definitely take the approach
of copying the Python 3 behavior on all versions of NumPy (when feasible)
and not being too concerned about compatibility with format on Python 2.
The future is Python 3.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposal to support format

2017-02-14 Thread Stephan Hoyer

On Tue, Feb 14, 2017 at 3:34 PM, Gustav Larsson 
wrote:

> Hi everyone!
>
> I want to discuss adding support for __format__ in ndarray and I am
> willing to
> contribute code-wise once consensus has been reached. It was briefly
> discussed on GitHub two years ago (https://github.com/numpy/
> numpy/issues/5543)
> and I will re-iterate some of the points made there and build off of that.
> I
> have been thinking about this a lot in the last few weeks and my thoughts
> turned
> into a fairly fleshed out proposal. The discussion should probably start
> more
> high-level, so I apologize if the level of detail is inappropriate at this
> point in time.
>
> I decided on a gist, since the email got too long and clear formatting
> helps:
>
> https://gist.github.com/gustavla/2783543be1204d2b5d368f6a1fb4d069

This is a lovely and clearly written document. Thanks for taking the time
to think through this!

I encourage you to submit it as a pull request to the NumPy repository as a
"NumPy Enhancement Proposal", either now or after we've discussed it:
https://docs.scipy.org/doc/numpy-dev/neps/index.html

> OK, those are my thoughts for now. What do you think?
>

Two thoughts for now:
1. For object arrays, I would default to calling format on each element
(your "map principle") rather than raising an error.
2. It's absolutely OK to leave functionality unimplemented and not
immediately nail down every edge case. As a default, I would suggest
raising errors whenever non-empty type specifications are provided rather
than raising errors in every case.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ANN: xarray v0.9 released

2017-02-01 Thread Stephan Hoyer

On Wed, Feb 1, 2017 at 12:55 AM, Marmaduke Woodman 
wrote:

> Looks very nice; is the API stable or are you waiting for a v1.0 release?
>

We are pretty close to full API stability but not quite there yet. Enough
people are using xarray in production that breaking changes are made with
serious caution (and deprecation cycles whenever feasible).

The only major backwards-incompatible change planned is an overhaul of
indexing to use labeled broadcasting and alignment:
https://github.com/pydata/xarray/issues/974

There are a few other "nice to have" features for v1.0 but that's the only
one that has the potential to change functionality in a way that we can't
cleanly deprecate.

> Is there significant overhead compared to plain ndarray?

Xarray is implemented in Python (not C), so it does have significant
overhead for every operation. Adding two arrays takes ~100 us, rather than
<1 us in NumPy. So you don't want to use it in your inner loop.

That said, the overhead is independent of the size of the array. So if you
work with large arrays, it is negligible.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: xarray v0.9 released

2017-01-31 Thread Stephan Hoyer

I'm pleased to announce the release of the latest major version of xarray,
v0.9.

xarray is an open source project and Python package that provides a toolkit
and data structures for N-dimensional labeled arrays. Its approach combines
an API inspired by pandas with the Common Data Model for self-described
scientific data.

This release includes five months worth of enhancements and bug fixes from
24 contributors, including some significant enhancements to the data model
that are not fully backwards compatible.

Highlights include:
- Coordinates are now optional in the xarray data model, even for
dimensions.
- Changes to caching, lazy loading and pickling to improve xarray’s
experience for parallel computing.
- Improvements for accessing and manipulating pandas.MultiIndex levels.
- Many new methods and functions, including quantile(), cumsum(),
cumprod(), combine_firstset_index(), reset_index(), reorder_levels(),
full_like(), zeros_like(), ones_like(), open_dataarray(), compute(),
Dataset.info(), testing.assert_equal(), testing.assert_identical(), and
testing.assert_allclose().

For more details, read the full release notes:
http://xarray.pydata.org/en/latest/whats-new.html

You can install xarray with pip or conda:
pip install xarray
conda install -c conda-forge xarray

Best,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy vs algebra Was: Integers to negative integer powers...

2017-01-03 Thread Stephan Hoyer

On Tue, Jan 3, 2017 at 3:05 PM, Nathaniel Smith  wrote:

> It's possible we should back off to just issuing a deprecation warning in
> 1.12?
>
> On Jan 3, 2017 1:47 PM, "Yaroslav Halchenko"  wrote:
>
>> hm... testing on current master (first result is from python's pow)
>>
>> $> python -c "import numpy; print('numpy version: ', numpy.__version__);
>> a=2; b=-2;  print(pow(a,b)); print(pow(numpy.array(a), b))"
>> ('numpy version: ', '1.13.0.dev0+02e2ea8')
>> 0.25
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> ValueError: Integers to negative integer powers are not allowed.
>>
>>
>> testing on Debian's packaged beta
>>
>> $> python -c "import numpy; print('numpy version: ', numpy.__version__);
>> a=2; b=-2;  print(pow(a,b)); print(pow(numpy.array(a), b))"
>> ('numpy version: ', '1.12.0b1')
>> 0.25
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> ValueError: Integers to negative integer powers are not allowed.
>>
>>
>> testing on stable debian box with elderly numpy, where it does behave
>> sensibly:
>>
>> $> python -c "import numpy; print('numpy version: ', numpy.__version__);
>> a=2; b=-2;  print(pow(a,b)); print(pow(numpy.array(a), b))"
>> ('numpy version: ', '1.8.2')
>> 0.25
>> 0
>>
>> what am I missing?
>>
>>
2 ** -2 should be 0.25.

On old versions of NumPy, you see the the incorrect answer 0. We are now
preferring to give an error rather than the wrong answer.


> >The pandas test suite triggered this behavior, but not intentionally,
>> and
>> >should be fixed in the next release:
>> >https://github.com/pandas-dev/pandas/pull/14498
>>
>> I don't think that was the full set of cases, e.g.
>>
>> (git)hopa/sid-i386:~exppsy/pandas[bf-i386]
>> $> nosetests -s -v pandas/tests/test_expressions.
>> py:TestExpressions.test_mixed_arithmetic_series
>> test_mixed_arithmetic_series (pandas.tests.test_expressions.TestExpressions)
>> ... ERROR
>>
>> ==
>> ERROR: test_mixed_arithmetic_series (pandas.tests.test_expressions
>> .TestExpressions)
>> --
>> Traceback (most recent call last):
>>   File 
>> "/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/tests/test_expressions.py",
>> line 223, in test_mixed_arithmetic_series
>> self.run_series(self.mixed2[col], self.mixed2[col], binary_comp=4)
>>   File 
>> "/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/tests/test_expressions.py",
>> line 164, in run_series
>> test_flex=False, **kwargs)
>>   File 
>> "/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/tests/test_expressions.py",
>> line 93, in run_arithmetic_test
>> expected = op(df, other)
>>   File "/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/core/ops.py", line
>> 715, in wrapper
>> result = wrap_results(safe_na_op(lvalues, rvalues))
>>   File "/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/core/ops.py", line
>> 676, in safe_na_op
>> return na_op(lvalues, rvalues)
>>   File "/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/core/ops.py", line
>> 652, in na_op
>> raise_on_error=True, **eval_kwargs)
>>   File 
>> "/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/computation/expressions.py",
>> line 210, in evaluate
>> **eval_kwargs)
>>   File 
>> "/home/yoh/deb/gits/pkg-exppsy/pandas/pandas/computation/expressions.py",
>> line 63, in _evaluate_standard
>> return op(a, b)
>> ValueError: Integers to negative integer powers are not allowed.
>>
>
Agreed, it looks like pandas still has this issue in the test suite.
Nonetheless, I don't think this should be an issue for users -- pandas
defines all handling of arithmetic to numpy.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy vs algebra Was: Integers to negative integer powers...

2017-01-03 Thread Stephan Hoyer

On Tue, Jan 3, 2017 at 9:00 AM, Yaroslav Halchenko 
wrote:

> Sorry for coming too late to the discussion and after PR "addressing"
> the issue by issuing an error was merged [1].  I got burnt by new
> behavior while trying to build fresh pandas release on Debian (we are
> freezing for release way too soon ;) ) -- some pandas tests failed since
> they rely on previous non-erroring behavior and we got  numpy 1.12.0~b1
> which included [1] in unstable/testing (candidate release) now.
>
> I quickly glanced over the discussion but I guess I have missed
> actual description of the problem being fixed here...  what was it??
>
> previous behavior, int**int->int made sense to me as it seemed to be
> consistent with casting Python's pow result to int, somewhat fulfilling
> desired promise for in-place operations and being inline with built-in
> pow results as far as I see it (up to casting).


I believe this is exactly the behavior we preserved. Rather, we turned some
cases that previously often gave wrong results (involving negative integer
powers) into errors.

The pandas test suite triggered this behavior, but not intentionally, and
should be fixed in the next release:
https://github.com/pandas-dev/pandas/pull/14498
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] in1d, but preserve shape of ar1

2016-12-19 Thread Stephan Hoyer

I think this is a great idea!

I agree that we need a new function. Because the new API is almost strictly
superior, we should try to pick a more general name that we can encourage
users to switch to from in1d.

Pandas calls this method "isin", which I think is a perfectly good name for
the multi-dimensional NumPy version, too:
http://pandas.pydata.org/pandas-docs/stable/generated/
pandas.Series.isin.html

It's a subjective call, but I would probably keep the new function in
arraysetops.py. (This is the sort of question well suited to GitHub rather
than the mailing list, though.)


On Mon, Dec 19, 2016 at 3:25 PM, Brenton R S Recht 
wrote:

> I started an enhancement request in the Github bug tracker at
> https://github.com/numpy/numpy/issues/8331 , but Jaime Frio recommended I
> bring it to the mailing list.
>
> `in1d` takes two arrays, `ar1` and `ar2`, and returns a 1d array with the
> same number of elements as `ar1`. The logical extension would be a function
> that does the same thing but returns a (possibly multi-dimensional) array
> of the same shape as `ar1`. The code already has a comment suggesting this
> could be done (see https://github.com/numpy/numpy/blob/master/numpy/lib/
> arraysetops.py#L444 ).
>
> I agree that changing the behavior of the existing function isn't an
> option, since it would break backwards compatability. I'm not sure adding
> an option keep_shape is good, since the name of the function ("1d")
> wouldn't match what it does (returns an array that might not be 1d). I
> think a new function is the way to go. This would be it, more or less:
>
> def items_in(ar1, ar2, **kwargs):
>   return np.in1d(ar1, ar2, **kwargs).reshape(ar1.shape)
>
> Questions I have are:
> * Function name? I was thinking something like `items_in` or `item_in`:
> the function returns whether each item in `ar1` is in `ar2`. Is "item" or
> "element" the right term here?
> * Are there any other changes that need to happen in arraysetops.py? Or
> other files? I ask this because although the file says "Set operations for
> 1D numeric arrays" right at the top, it's growing increasingly not 1D:
> `unique` recently changed to operate on multidimensional arrays, and I'm
> proposing a multidimensional version of `in1d`. `ediff1d` could probably be
> tweaked into a version that operates along an axis the same way unique does
> now, fwiw. Mostly I want to know if I should put my code changes in this
> file or somewhere else.
>
> Thanks,
>
> -brsr
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ufunc for sum of squared difference

2016-11-14 Thread Stephan Hoyer

On Mon, Nov 14, 2016 at 5:40 PM, Matthew Harrigan <
harrigan.matt...@gmail.com> wrote:

> Essentially it creates a reduce for a function which isn't binary.  I
> think this would be generally useful.
>

NumPy already has a generic enough interface for creating such ufuncs. In
fact, it's called a "generalized ufunc":
https://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html

I think you could already write "implicit reductions" using gufuncs?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] array comprehension

2016-11-04 Thread Stephan Hoyer

On Fri, Nov 4, 2016 at 10:24 AM, Nathaniel Smith  wrote:

> Are you sure fromiter doesn't make an intermediate list or equivalent? It
> has to collect all the values before it can know the shape or dtype of the
> array to put them in.
>
fromiter dynamically resizes a NumPy array, like a Python list, except with
a growth factor of 1.5 (rather than 1.25):
https://github.com/numpy/numpy/blob/bb59409abf5237c155a1dc4c4d5b31e4acf32fbe/numpy/core/src/multiarray/ctors.c#L3721
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] array comprehension

2016-11-04 Thread Stephan Hoyer

On Fri, Nov 4, 2016 at 7:12 AM, Francesc Alted  wrote:

> Does this generalize to >1 dimensions?
>>
>
> A reshape() is not enough?  What do you want to do exactly?
>

np.fromiter takes scalar input and only builds a 1D array. So it actually
can't combine multiple values at once unless they are flattened out in
Python. It could be nice to add support for non-scalar inputs, stacking
them similarly to np.array. Likewise, it could be nice to add an axis
argument, so it can work similarly to np.stack.

More generally, you might want to iterate and rebuild over arbitrary
dimension(s) of an array. Something like
np.stack([x for x in np.unstack(y, axis)], axis)

But, we also don't have an unstack function. This would mostly be syntactic
sugar, but I think it would be a nice addition. Such a function actually
exists in TensorFlow:
https://g3doc.corp.google.com/third_party/tensorflow/g3doc/api_docs/python/array_ops.md?cl=head#unstack
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] __numpy_ufunc__

2016-10-31 Thread Stephan Hoyer

Recall that I think we wanted to rename this to __array_ufunc__, so we
could change the function signature:
https://github.com/numpy/numpy/issues/5986

I'm still a little nervous about this. Chunk -- what is your proposal for
resolving the outstanding issues from
https://github.com/numpy/numpy/issues/5844?

On Mon, Oct 31, 2016 at 10:31 AM, Charles R Harris <
charlesr.har...@gmail.com> wrote:

>
>
> On Mon, Oct 31, 2016 at 11:08 AM, Marten van Kerkwijk <
> m.h.vankerkw...@gmail.com> wrote:
>
>> Hi Chuck,
>>
>> I've revived my Quantity PRs that use __numpy_ufunc__ but is it
>> correct that at present in *dev, one cannot use it?
>>
>
> It's not enabled yet.
>
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] __numpy_ufunc__

2016-10-29 Thread Stephan Hoyer

I'm happy to revisit the __numpy_ufunc__ discussion (I still want to see it
happen!), but I don't recall scalars being a point of contention.

The obvious thing to do with scalars would be to treat them the same as
0-dimensional arrays, though I might be missing some nuance...

On Sat, Oct 29, 2016 at 6:02 AM, Charles R Harris  wrote:

> Hi All,
>
> Does anyone remember discussion of numpy scalars apropos __numpy_ufunc__?
>
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Combining covariance and correlation coefficient into one numpy.cov call

2016-10-26 Thread Stephan Hoyer

On Wed, Oct 26, 2016 at 11:03 AM, Mathew S. Madhavacheril <
mathewsyr...@gmail.com> wrote:

> On Wed, Oct 26, 2016 at 1:46 PM, Stephan Hoyer <sho...@gmail.com> wrote:
>
>> I wonder if the goals of this addition could be achieved by simply adding
>> an optional `cov` argument
>>
> to np.corr, which would provide a pre-computed covariance.
>>
>
> That's a fair suggestion which I'm happy to switch to. This eliminates the
> need for two new functions.
> I'll add an optional `cov = False` argument to numpy.corrcoef that returns
> a tuple (corr, cov) instead.
>
>
>>
>> Either way, `covcorr` feels like a helper function that could exist in
>> user code rather than numpy proper.
>>
>
> The user would have to re-implement the part that converts the covariance
> matrix to a correlation
> coefficient. I made this PR to avoid that code duplication.
>

With the API I was envisioning (or even your proposed API, for that
matter), this function would only be a few lines, e.g.,

def covcorr(x):
cov = np.cov(x)
corr = np.corrcoef(x, cov=cov)
return (cov, corr)

Generally, functions this short should be provided as recipes (if at all)
rather than be added to numpy proper, unless the need for them is extremely
common.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Combining covariance and correlation coefficient into one numpy.cov call

2016-10-26 Thread Stephan Hoyer

I wonder if the goals of this addition could be achieved by simply adding
an optional `cov` argument to np.corr, which would provide a pre-computed
covariance.

Either way, `covcorr` feels like a helper function that could exist in user
code rather than numpy proper.

On Wed, Oct 26, 2016 at 10:27 AM, Mathew S. Madhavacheril <
mathewsyr...@gmail.com> wrote:

> Hi all,
>
> I posted a pull request:
> https://github.com/numpy/numpy/pull/8211
>
> which adds a function `numpy.covcorr` that calculates both
> the covariance matrix and correlation coefficient with a single
> call to `numpy.cov` (which is often an expensive call for large
> data-sets). A function `numpy.covtocorr` has also been added
> that converts a covariance matrix to a correlation coefficent,
> and `numpy.corrcoef` has been modified to call this. The
> motivation here is that one often needs the covariance for
> subsequent analysis and the correlation coefficient for
> visualization, so instead of forcing the user to write their own
> code to convert one to the other, we want to allow both to
> be obtained from `numpy` as efficiently as possible.
>
> Best,
> Mathew
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Preserving NumPy views when pickling

2016-10-25 Thread Stephan Hoyer

On Tue, Oct 25, 2016 at 1:07 PM, Nathaniel Smith  wrote:

> Concretely, what do would you suggest should happen with:
>
> base = np.zeros(1)
> view = base[:10]
>
> # case 1
> pickle.dump(view, file)
>
> # case 2
> pickle.dump(base, file)
> pickle.dump(view, file)
>
> # case 3
> pickle.dump(view, file)
> pickle.dump(base, file)
>
> ?
>

I see what you're getting at here. We would need a rule for when to include
the base in the pickle and when not to. Otherwise, pickle.dump(view, file)
always contains data from the base pickle, even with view is much smaller
than base.

The safe answer is "only use views in the pickle when base is already being
pickled", but that isn't possible to check unless all the arrays are
together in a custom container. So, this isn't really feasible for NumPy.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Preserving NumPy views when pickling

2016-10-25 Thread Stephan Hoyer

With a custom wrapper class, it's possible to preserve NumPy views when
pickling:
https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling

This can result in significant time/space savings with pickling views along
with base arrays and brings the behavior of NumPy more in line with Python
proper. Is this something that we can/should port into NumPy itself?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] padding options for diff

2016-10-24 Thread Stephan Hoyer

This looks like a welcome addition in functionality! It will be nice to be
able to finally (soft) deprecate ediff1d.

On Mon, Oct 24, 2016 at 5:44 AM, Matthew Harrigan <
harrigan.matt...@gmail.com> wrote:

> I posted a pull request  which
> adds optional padding kwargs "to_begin" and "to_end" to diff.  Those
> options are based on what's available in ediff1d.  It closes this issue
> 
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] how to name "contagious" keyword in np.ma.convolve

2016-10-18 Thread Stephan Hoyer

On Tue, Oct 18, 2016 at 4:18 PM, Allan Haldane 
wrote:

> As for whether it should default to "True" or "False", the arguments I
> see are:
>
>  * False, because that is the way most functions like `np.ma.sum`
>already work, as well as matlab and octave's similar "nanconv".
>
>  * True, because its effects are more visible and might lead to less
>surprises. The "False" case seems like it is often not what the user
>intended. Eg, it affects the overall normalization of normalized
>kernels, and the choice of 0 seems arbitrary.
>
> If no one says anything, I'd probably go with True
>

I also have serious concerns about if it ever actually makes sense to use
`propagate_mask=False`.

So, I think it's definitely appropriate to default to `propagate_mask=True`.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Integers to negative integer powers, time for a decision.

2016-10-09 Thread Stephan Hoyer

On Sun, Oct 9, 2016 at 6:25 AM, Sebastian Berg 
wrote:

> For what its worth, I still feel it is probably the only real option to
> go with error, changing to float may have weird effects. Which does not
> mean it is impossible, I admit, though I would like some data on how
> downstream would handle it. Also would we need an int power? The fpower
> seems more straight forward/common pattern.
> If errors turned out annoying in some cases, a seterr might be
> plausible too (as well as a deprecation).
>

I agree with Sebastian and Nathaniel. I don't think we can deviating from
the existing behavior (int ** int -> int) without breaking lots of existing
code, and if we did, yes, we would need a new integer power function.

I think it's better to preserve the existing behavior when it gives
sensible results, and error when it doesn't. Adding another function
float_power for the case that is currently broken seems like the right way
to go.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] PR 8053 np.random.multinomial tolerance param

2016-09-26 Thread Stephan Hoyer

I would actually be just as happy to relax the tolerance here to 1e-8
always. I doubt this would catch any fewer bugs than the current default.
In contrast, adding new parameters adds cognitive overload for everyone
encountering the function.

Also, for your use case note that tensorflow has it's own function for
generating random values from a multinomial distribution:
https://www.tensorflow.org/versions/r0.10/api_docs/python/constant_op.html#multinomial

On Mon, Sep 26, 2016 at 11:52 AM, Alex Beloi  wrote:

> Hello,
>
>
>
> Pull Request: https://github.com/numpy/numpy/pull/8053
>
>
>
> I would like to expose a tolerance parameter for the function
> numpy.random.multinomial.
>
>
>
> The function `multinomial(n, pvals, size=None)` correctly raises exception
> when `sum(pvals) > 1 + 1e-12` as these values should sum to 1. However,
> other libraries often cannot or do not guarantee such level of precision.
>
>
>
> Specifically, I have encountered issues with tensorflow function
> tf.nn.softmax, which is expected to output a tensor whose values sum to 1,
> but often with precision of only 1e-8.
>
>
>
> I propose to expose the `1e-12` tolerance to a non-negative float
> parameter with default value `1e-12`.
>
>
>
> Alex
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] guvectorize, a helper for writing generalized ufuncs

2016-09-26 Thread Stephan Hoyer

I have put a pull request implementing numpy.guvectorize up for review:
https://github.com/numpy/numpy/pull/8054

Cheers,
Stephan

On Tue, Sep 13, 2016 at 10:54 PM, Travis Oliphant <tra...@continuum.io>
wrote:

> There has been some discussion on the Numba mailing list as well about a
> version of guvectorize that doesn't compile for testing and flexibility.
>
> Having this be inside NumPy itself seems ideal.
>
> -Travis
>
>
> On Tue, Sep 13, 2016 at 12:59 PM, Stephan Hoyer <sho...@gmail.com> wrote:
>
>> On Tue, Sep 13, 2016 at 10:39 AM, Nathan Goldbaum <nathan12...@gmail.com>
>> wrote:
>>
>>> I'm curious whether you have a plan to deal with the python functional
>>> call overhead. Numba gets around this by JIT-compiling python functions -
>>> is there something analogous you can do in NumPy or will this always be
>>> limited by the overhead of repeatedly calling a Python implementation of
>>> the "core" operation?
>>>
>>
>> I don't think there is any way to avoid this in NumPy proper, but that's
>> OK (it's similar to the existing overhead of vectorize).
>>
>> Numba already has guvectorize (and it's own version of vectorize as
>> well), which already does exactly this.
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
>
> *Travis Oliphant, PhD*
> *Co-founder and CEO*
>
>
> @teoliphant
> 512-222-5440
> http://www.continuum.io
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Stephan Hoyer

On Tue, Sep 13, 2016 at 11:05 AM, Lluís Vilanova 
wrote:

> Whenever we repr an array using 'S', we can instead show a unicode in py3.
> That
> keeps the binary representation, but will always show the expected result
> to
> users, and it's only a handful of lines added to dump_data().
>
> If needed, I could easily add a bytes array to make the alternative
> explicit
> (where py3 would repr the contents as b'foo').
>
> This would only leave the less-common paths inconsistent across python
> versions,
> which should not be a problem for most examples/doctests:
>
> * A 'U' array will show u'foo' in py2 and 'foo' in py3.
> * The new binary array will show 'foo' in py2 and b'foo' in py3 (that
> could also
>   be patched on the repr code).
> * A 'O' array will not be able to do any meaningful repr conversions.
>
>
> A more complex alternative (and actually closer to what I'm proposing) is
> to
> modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
> string. It would have the binary compatibility, while being a unicode
> string in
> practice.

I'm afraid these are both also non-starters at this point. NumPy's string
dtype corresponds to bytes on Python 3, and you can use it to store
arbitrary binary values. Would it really be an improvement to change the
repr, if the scalar value resulting from indexing is still bytes?

The sanest approach is probably a new dtype for one-byte strings. We talked
about this a few years ago, but nobody has implemented it yet:
http://numpy-discussion.scipy.narkive.com/3nqDu3Zk/a-one-byte-string-dtype

(normally I would link to the archives on scipy.org, but the certificate
for HTTPS has expired so you see a big error message right now...)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] guvectorize, a helper for writing generalized ufuncs

2016-09-13 Thread Stephan Hoyer

On Tue, Sep 13, 2016 at 10:39 AM, Nathan Goldbaum 
wrote:

> I'm curious whether you have a plan to deal with the python functional
> call overhead. Numba gets around this by JIT-compiling python functions -
> is there something analogous you can do in NumPy or will this always be
> limited by the overhead of repeatedly calling a Python implementation of
> the "core" operation?
>

I don't think there is any way to avoid this in NumPy proper, but that's OK
(it's similar to the existing overhead of vectorize).

Numba already has guvectorize (and it's own version of vectorize as well),
which already does exactly this.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] guvectorize, a helper for writing generalized ufuncs

2016-09-13 Thread Stephan Hoyer

NumPy has the handy np.vectorize for turning Python code that operates on
scalars into a function that vectorizes works like a ufunc, but no helper
function for creating generalized ufuncs (
http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html).

np.apply_along_axis accomplishes some of this, but it only allows a single
core dimension on a single argument.

So I propose adding a new object, np.guvectorize(pyfunc, signature, otypes,
...), where pyfunc is defined over the core dimensions only of any inputs
and signature is any valid gufunc signature (a string). Calling this object
would apply the gufunc. This is inspired by the similar numba.guvectorize,
which is currently the easiest way to write a gufunc in Python.

In addition to be handy like vectorize, such functionality would be
especially useful for with working libraries that build upon NumPy to
extend the capabilities of generalized ufuncs (e.g., xarray after
https://github.com/pydata/xarray/pull/964).

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] New Indexing Methods Revival #N (subclasses!)

2016-09-06 Thread Stephan Hoyer

On Mon, Sep 5, 2016 at 6:02 PM, Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> p.s. Just to be clear: personally, I think we should have neither
> `__numpy_getitem__` nor a mixin; we should just get the quite
> wonderful new indexing methods!

+1

I don't maintain ndarray subclasses (I prefer composition), but I don't
think it's too difficult to require implementing vindex and oindex
properties from scratch.

Side note: I would prefer the more verbose "legacy_index" to "lindex". We
really want to discourage this one, and two new abbreviations are bad
enough.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Views and Increments

2016-08-08 Thread Stephan Hoyer

On Mon, Aug 8, 2016 at 6:11 AM, Anakim Border  wrote:

> Alternative version:
>
> >>> a = np.arange(10)
> >>> a[np.array([1,6,5])] += 1
> >>> a
> array([0, 2, 2, 3, 4, 6, 7, 7, 8, 9])
>

I haven't checked, but a likely explanation is that Python itself
interprets a[b] += c as a[b] = a[b] + c.

Python has special methods for inplace assignment (__setitem__) and inplace
arithmetic (__iadd__) but no special methods for inplace arithmetic and
assignment at the same time, so this is really out of NumPy's control here.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ufunc reduceat behavior on empty slices

2016-07-29 Thread Stephan Hoyer

Jaime brought up the same issue recently, along with some other issues for
ufunc.reduceat:
https://mail.scipy.org/pipermail/numpy-discussion/2016-March/075199.html

I completely agree with both of you that the current behavior for empty
slices is very strange and should be changed to remove the special case.
Nathaniel Smith voiced the same opinion on the GitHub issue [1].

I think the path forward here (as Nathaniel writes) is pretty clear:
1. Start issuing a FutureWarning about a future change.
2. Fix this in a release or two.

[1] https://github.com/numpy/numpy/issues/834

On Fri, Jul 29, 2016 at 11:42 AM, Erik Brinkman 
wrote:

> Hi,
>
> The behavior of a ufuncs reduceat on empty slices seems a little strange,
> and I wonder if there's a reason behind it / if there's a route to
> potentially changing it. First, I'll go into a little background.
>
> I've been making a lot of use of ufuncs reduceat functionality on
> staggered arrays. In general, I'll have "n" arrays, each with size "s[n]"
> and I'll store them in one array "x", such that "s.sum() == x.size".
> reduceat is great because I use
>
> ufunc.reduceat(x, np.insert(s[:-1].cumsum(), 0, 0))
>
> to get some summary information about each array. However, reduceat seems
> to behave strangely for empty slices. To make things concrete, let's assume:
>
> import numpy as np
> s = np.array([3, 0, 2])
> x = np.arange(s.sum())
> inds = np.insert(s[:-1].cumsum(), 0, 0)
> # [0, 3, 3]
> np.add.reduceat(x, inds)
> # [3, 3, 7] not [3, 0, 7]
> # This is distinct from
> np.fromiter(map(np.add.reduce, np.array_split(x, inds[1:])), x.dtype,
> s.size - 1)
> # [3, 0, 7] what I wanted
>
> The current documentation
> 
> on reduceat first states:
>
> For i in range(len(indices)), reduceat computes
> ufunc.reduce(a[indices[i]:indices[i+1]])
>
> That would suggest the outcome, that I expected. However, in the examples
> section it goes into a bunch of examples which contradict that statement
> and instead suggest that the actual algorithm is more akin to:
>
> ufunc.reduce(a[indices[i]:indices[i+1]]) if indices[i+1] > indices[i] else
> indices[i]
>
> Looking at the source
> ,
> it seems like it's copying indices[i], and then while there are more
> elements to process it keeps reducing, resulting in this unexpected
> behavior. It seems like the proper thing to do would be start with
> ufunc.identity, and then reduce. This is slightly less performant than
> what's implemented, but more "correct." There could, of course, just be a
> switch to copy the identity only when the slice is empty.
>
> Is there a reason it's implemented like this? Is it just for performance,
> or is this strange behavior *useful* somewhere? It seems like "fixing"
> this would be bad because you'll be changing somewhat documented
> functionality in a backwards incompatible way. What would the best approach
> to "fixing" this be? add another function "reduceat_"? add a flag to
> reduceat to do the proper thing for empty slices?
>
> Finally, is there a good way to work around this? I think for now I'm just
> going to mask out the empty slices and use insert to add them back in, but
> if I'm missing an obvious solution, I'll look at that too. I need to mask
> them out because, np.add.reduceat(x, 5) would ideally return 0, but
> instead it throws an error since 5 is out of range...
>
> Thanks for indulging my curiosity,
> Erik
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Is there any official position on PEP484/mypy?

2016-07-29 Thread Stephan Hoyer

I'm a big fan of type annotations and would support moving your repo over
to the official typeshed repo or the NumPy GitHub organization to indicate
it's official status. This is excellent work -- thank you for putting in
the effort!

Like Ben, I have also wished for type annotation support for dimension
shapes/sizes. Someone recently suggested using datashape as potential
syntax for this on the Blaze mailing list [1]. I have no idea how hard it
would be to actually implement type inference for shape. Possibly an
interesting research project? I know it's out of scope for mypy / PEP 484
for now.

[1]
https://groups.google.com/a/continuum.io/forum/#!topic/blaze-dev/0vNo4f-tNSk

On Fri, Jul 29, 2016 at 9:31 AM, Daniel Moisset 
wrote:

> I don't think a tool like mypy nor PEP 484 can talk about specific sizes
> (like the MxN and NxP for a matrix multiplication), but probably there are
> things that can be done at least about dimensionality (saying "a and b are
> 2d matrices, v is a 1-d vector"). But that's much farther about the road.
> For now you'll be able to detect simpler errors like treating an ndarray as
> a python list, method names mispells, or wrong counts/order of method
> arguments.
>
> Best,
>D.
>
> On Fri, Jul 29, 2016 at 2:31 PM, Benjamin Root 
> wrote:
>
>> One thing that I have always wished for from a project like mypy is the
>> ability to annotate what the expected shape should be. Too often, I get a
>> piece of code from a coworker and it comes with no docstring explaining the
>> expected dimensions of the input arrays and what the output array is going
>> to be. What would be really awesome is the ability to do something like
>> annotate that "a" is MxN, and "b" is NxP, and that "c" is Px3. Even if the
>> linter can't really check to make sure that the shapes would be respected,
>> it would still be nice to have a common way of expressing the expected
>> shapes in this annotation format.
>>
>> As for matplotlib, we would need to express much more complicated
>> annotations, because our API is so flexible. It would be useful to keep an
>> eye out to those needs as well.
>>
>> Cheers!
>> Ben Root
>>
>>
>> On Fri, Jul 29, 2016 at 5:33 AM, Daniel Moisset 
>> wrote:
>>
>>> Hi Sebastian, thanks for your reply
>>>
>>> I'm glad to hear that you see value in having type annotations. Just to
>>> clarify, my question was aimed at surveying if there was interest in
>>> accepting the work we're already doing if we contribute it and if it has
>>> value for the numpy project. I'm aware there's effort involved; some
>>> colleagues and me are already involved doing that at
>>> https://github.com/machinalis/mypy-data because it's valuable for
>>> ourselves, so the volunteers are already here. You of course are invited to
>>> comment on the existing code and try it :) (or joining the effort, goes
>>> without saying)
>>>
>>> Running the checker on the test suite is probably the best way to
>>> validate the annotations (the normal way would be checking the annotations
>>> against the code, but that doesn't work with C extensions like numpy).
>>> That's something we haven't been doing yet but it's an obvious next step
>>> now that some simple examples are working.
>>> WRT "I wonder if all or most of numpy can be easily put into it.",
>>> we've covered ndarray (and matrix soon) which are the core types, things
>>> built upon that shouldn't be too hard. We found some snags along the way
>>> [1] [2], but no of it is a showstopper and I'm quite sure we'll fix those
>>> in time. But of course, if someone wants to try it out it will be a better
>>> validation than my optimism to see if this makes sense :)
>>>
>>> Thanks again and I'd be happy to hear more opinions from other numpy
>>> devs!
>>>
>>> Best,
>>>D.
>>>
>>> [1] http://www.machinalis.com/blog/writing-type-stubs-for-numpy/
>>> [2] https://github.com/machinalis/mypy-data/issues
>>>
>>>
>>> On 29 Jul 2016 08:31, "Sebastian Berg" 
>>> wrote:
>>>
 On Mi, 2016-07-27 at 20:07 +0100, Daniel Moisset wrote:
 >
 > Hi,
 >
 > I work at Machinalis were we use a lot of numpy (and the pydata stack
 > in general). Recently we've also been getting involved with mypy,
 > which is a tool to type check (not on runtime, think of it as a
 > linter) annotated python code (the way of annotating python types has
 > been recently standarized in PEP 484).
 >
 > As part of that involvement we've started creating type annotations
 > for the Python libraries we use most, which include numpy. Mypy
 > provides a way to specify types with annotations in separate files in
 > case you don't have control over a library, so we have created an
 > initial proof of concept at [1], and we are actively improving it.
 > You can find some additional information about it and some problems
 > we've found on the way at this

Re: [Numpy-discussion] isnan() equivalent for np.NaT?

2016-07-18 Thread Stephan Hoyer

Agreed -- this would be really nice to have. For now, the best you can do
is something like the following:


def is_na(x):
x = np.asarray(x)
if np.issubdtype(x.dtype, (np.datetime64, np.timedelta64)):  # ugh
int_min = np.iinfo(np.int64).min
return x.view('int64') == int_min
else:
return np.isnan(x)


On Mon, Jul 18, 2016 at 3:39 PM, Gerrit Holl  wrote:

> On 18 July 2016 at 22:20, Scott Sanderson 
> wrote:
> > I'm working on upgrading Zipline (github.com/quantopian/zipline) to the
> > latest numpy, and I'm getting a FutureWarnings about the upcoming change
> in
> > the behavior of comparisons on np.NaT.  I'd like to be able to do checks
> for
> > NaT in a way that's forwards-compatible, but I couldn't find a function
> > analogous to `np.isnan` for NaT.  Am I missing something that already
> > exists?  If not, is there interest in such a function? I'd like to be
> able
> > to pre-emptively fix the warnings in Zipline so that we're ready when the
> > change actually happens, but it doesn't seem like the necessary tools are
> > available right now.
>
> Hi Scott,
>
> see https://github.com/numpy/numpy/issues/5610
>
> Gerrit.
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Added atleast_nd, request for clarification/cleanup of atleast_3d

2016-07-06 Thread Stephan Hoyer

On Tue, Jul 5, 2016 at 10:06 PM, Nathaniel Smith  wrote:

> I don't know how typical I am in this. But it does make me wonder if the
> atleast_* functions act as an attractive nuisance, where new users take
> their presence as an implicit recommendation that they are actually a
> useful thing to reach for, even though they... aren't that. And maybe we
> should be recommending folk move away from them rather than trying to
> extend them further?
>
Agreed. I would avoid adding atleast_nd. We could discourage using
atleast_3d (certainly the behavior is indeed surprising), but I'm not sure
it's worth the trouble.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Datarray 0.1.0 release

2016-06-10 Thread Stephan Hoyer

On Fri, Jun 10, 2016 at 12:51 PM, Matthew Brett 
wrote:

> If you like the general idea, and you don't mind the pandas
> dependency, `xray` is a much better choice for production code right
> now, and will do the same stuff and more:
>
> https://pypi.python.org/pypi/xray/0.4.1
>
>

Hi Matthew,

Congrats on the release!

I just wanted to point out that "xray" is now known as "xarray":
https://pypi.python.org/pypi/xarray/

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ENH: compute many inner products quickly

2016-06-06 Thread Stephan Hoyer

On Mon, Jun 6, 2016 at 3:32 PM, Jaime Fernández del Río <
jaime.f...@gmail.com> wrote:

> Since we are at it, should quadratic/bilinear forms get their own function
> too?  That is, after all, what the OP was asking for.
>

If we have matvecmul and vecmul, then how to implement bilinear forms
efficiently becomes pretty clear:
np.vecmul(b, np.matvecmul(A, b))

I'm not sure writing a dedicated function in numpy itself makes sense for
something this easy.

I suppose there would be some performance gains from not saving the
immediate result, but I suspect this would be premature optimization in
most cases.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ENH: compute many inner products quickly

2016-06-05 Thread Stephan Hoyer

On Sun, Jun 5, 2016 at 5:08 PM, Mark Daoust  wrote:

> Here's the einsum version:
>
> `es =  np.einsum('Na,ab,Nb->N',X,A,X)`
>
> But that's running ~45x slower than your version.
>
> OT: anyone know why einsum is so bad for this one?
>

I think einsum can create some large intermediate arrays. It certainly
doesn't always do multiplication in the optimal order:
https://github.com/numpy/numpy/pull/5488
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] ENH: compute many inner products quickly

2016-06-05 Thread Stephan Hoyer

If possible, I'd love to add new functions for "generalized ufunc" linear
algebra, and then deprecate (or at least discourage) using the older
versions with inferior broadcasting rules. Adding a new keyword arg means
we'll be stuck with an awkward API for a long time to come.

There are three types of matrix/vector products for which ufuncs would be
nice:
1. matrix-matrix product (covered by matmul)
2. matrix-vector product
3. vector-vector (inner) product

It's straightful to implement either of the later two options by inserting
dummy dimensions and then calling matmul, but that's a pretty awkward API,
especially for inner products. Unfortunately, we already use the two most
obvious one word names for vector inner products (inner and dot). But on
the other hand, one word names are not very descriptive, and the short name
"dot" probably mostly exists because of the lack of an infix operator.

So I'll start by throwing out some potential new names:

For matrix-vector products:
matvecmul (if it's worth making a new operator)

For inner products:
vecmul (similar to matmul, but probably too ambiguous)
dot_product
inner_prod
inner_product

On Sat, May 28, 2016 at 8:53 PM, Scott Sievert 
wrote:

> I recently ran into an application where I had to compute many inner
> products quickly (roughy 50k inner products in less than a second). I
> wanted a vector of inner products over the 50k vectors, or `[x1.T @ A @ x1,
> …, xn.T @ A @ xn]` with A.shape = (1k, 1k).
>
> My first instinct was to look for a NumPy function to quickly compute
> this, such as np.inner. However, it looks like np.inner has some other
> behavior and I couldn’t get tensordot/einsum to work for me.
>
> Then a labmate pointed out that I can just do some slick matrix
> multiplication to compute the same quantity, `(X.T * A @ X.T).sum(axis=0)`.
> I opened [a PR] with this, and proposed that we define a new function
> called `inner_prods` for this.
>
> However, in the PR, @shoyer pointed out
>
> > The main challenge is to figure out how to transition the behavior of
> all these operations, while preserving backwards compatibility. Quite
> likely, we need to pick new names for these functions, though we should try
> to pick something that doesn't suggest that they are second class
> alternatives.
>
> Do we choose new function names? Do we add a keyword arg that changes what
> np.inner returns?
>
> [a PR]:https://github.com/numpy/numpy/pull/7690
>
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] NumPy 1.11 docs

2016-05-30 Thread Stephan Hoyer

Awesome, thanks Ralf!
On Sun, May 29, 2016 at 1:13 AM Ralf Gommers <ralf.gomm...@gmail.com> wrote:

> On Sun, May 29, 2016 at 4:35 AM, Stephan Hoyer <sho...@gmail.com> wrote:
>
>> These still are missing from the SciPy.org page, several months after the
>> release.
>>
>
> Thanks Stephan, that needs fixing.
>
>
>>
>> What do we need to do to keep these updated?
>>
>
>
> https://github.com/numpy/numpy/blob/master/doc/HOWTO_RELEASE.rst.txt#update-docsscipyorg
>
>
>> Is there someone at Enthought we should ping? Or do we really just need
>> to transition to different infrastructure?
>>
>
> No, we just need to not forget:) The release manager normally does this,
> or he pings someone else to do it. At the moment Pauli, Julian, Evgeni and
> me have access to the server. I'll fix it up today.
>
> Ralf
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] NumPy 1.11 docs

2016-05-28 Thread Stephan Hoyer

These still are missing from the SciPy.org page, several months after the
release.

What do we need to do to keep these updated? Is there someone at Enthought
we should ping? Or do we really just need to transition to different
infrastructure?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Integers to integer powers

2016-05-24 Thread Stephan Hoyer

On Tue, May 24, 2016 at 10:31 AM, Alan Isaac  wrote:

> Yes, but that one case is trivial: a*a


an_explicit_name ** 2 is much better than an_explicit_name *
an_explicit_name, though.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Integers to integer powers

2016-05-24 Thread Stephan Hoyer

On Tue, May 24, 2016 at 9:41 AM, Alan Isaac  wrote:

> What exactly is the argument against *always* returning float
> (even for positive integer exponents)?
>

If we were starting over from scratch, I would agree with you, but the int
** 2 example feels quite compelling to me. I would guess there lots of code
out there that expects the result to have integer dtype.

As a contrived example, I might write np.arange(n) ** 2 to produce an
indexer for the diagonal elements of an array.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-17 Thread Stephan Hoyer

On Tue, May 17, 2016 at 12:18 AM, Robert Kern <robert.k...@gmail.com> wrote:

> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer <sho...@gmail.com> wrote:
> > 1. When writing a library of stochastic functions that take a seed as an
> input argument, and some of these functions call multiple other such
> stochastic functions. Dask is one such example [1].
>
> Can you clarify the use case here? I don't really know what you are doing
> here, but I'm pretty sure this is not the right approach.
>

Here's a contrived example. Suppose I've written a simulator for cars that
consists of a number of loosely connected components (e.g., an engine,
brakes, etc.). The behavior of each component of our simulator is
stochastic, but we want everything to be fully reproducible, so we need to
use seeds or RandomState objects.

We might write our simulate_car function like the following:

def simulate_car(engine_config, brakes_config, seed=None):
rs = np.random.RandomState(seed)
engine = simulate_engine(engine_config, seed=rs.random_seed())
brakes = simulate_brakes(brakes_config, seed=rs.random_seed())
...

The problem with passing the same RandomState object (either explicitly or
dropping the seed argument entirely and using the  global state) to both
simulate_engine and simulate_breaks is that it breaks encapsulation -- if I
change what I do inside simulate_engine, it also effects the brakes.

The dask use case is actually pretty different -- the intent is to create
many random numbers in parallel using multiple threads or processes
(possibly in a distributed fashion). I know that skipping ahead is the
standard way to get independent number streams for parallel sampling, but
that isn't exposed in numpy.random, and setting distinct seeds seems like a
reasonable alternative for scientific computing use cases.

> It's only pseudo-private. This is an authorized use of it.
>
> However, for this case, I usually just pass around the the numpy.random
> module itself and let duck-typing take care of the rest.
>

I like the duck-typing approach. That's very elegant.

If this is an authorized use of the global RandomState object, let's
document it! Otherwise cautious library maintainers like myself will
discourage using it :).

> > [3] On a side note, if there's no longer a good reason to keep this
> object private, perhaps we should expose it in our public API. It would
> certainly be useful -- scikit-learn is already using it (see links in the
> pandas PR above).
>
> Adding a public get_global_random_state() function might be in order.
>

Yes, possibly.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-16 Thread Stephan Hoyer

Looking at the dask helper function again reminds me of an important cavaet
to this approach, which was pointed out to me by Clark Fitzgerald.

If you generate a moderately large number of random seeds in this fashion,
you are quite likely to have collisions due to the Birthday Paradox. For
example, you have a 50% chance of encountering at least one collision if
you generate only 77,000 seeds:
https://en.wikipedia.org/wiki/Birthday_attack

The docstring for this function should document this limitation of the
approach, which is still appropriate for a small number of seeds. Our
implementation can also encourage creating these seeds in a single
vectorized call to random_seed, which can significantly reduce the
likelihood of collisions between seeds generated in a single call to
random_seed with something like the following:

def random_seed(size):
base = np.random.randint(2 ** 32)
offset = np.arange(size)
return (base + offset) % (2 ** 32)

In principle, I believe this could generate the full 2 ** 32 unique seeds
without any collisions.

Cryptography experts, please speak up if I'm mistaken here.

On Mon, May 16, 2016 at 8:54 PM, Stephan Hoyer <sho...@gmail.com> wrote:

> I have recently encountered several use cases for randomly generate random
> number seeds:
>
> 1. When writing a library of stochastic functions that take a seed as an
> input argument, and some of these functions call multiple other such
> stochastic functions. Dask is one such example [1].
>
> 2. When a library needs to produce results that are reproducible after
> calling numpy.random.seed, but that do not want to use the functions in
> numpy.random directly. This came up recently in a pandas pull request [2],
> because we want to allow using RandomState objects as an alternative to
> global state in numpy.random. A major advantage of this approach is that it
> provides an obvious alternative to reusing the private numpy.random._mtrand
> [3].
>
> The implementation of this function (and the corresponding method on
> RandomState) is almost trivial, and I've already written such a utility for
> my code:
>
> def random_seed():
> # numpy.random uses uint32 seeds
> np.random.randint(2 ** 32)
>
> The advantage of adding a new method is that it avoids the need for
> explanation by making the intent of code using this pattern obvious. So I
> think it is a good candidate for inclusion in numpy.random.
>
> Any opinions?
>
> [1]
> https://github.com/dask/dask/blob/e0b246221957c4bd618e57246f3a7ccc8863c494/dask/utils.py#L336
> [2] https://github.com/pydata/pandas/pull/13161
> [3] On a side note, if there's no longer a good reason to keep this object
> private, perhaps we should expose it in our public API. It would certainly
> be useful -- scikit-learn is already using it (see links in the pandas PR
> above).
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Proposal: numpy.random.random_seed

2016-05-16 Thread Stephan Hoyer

I have recently encountered several use cases for randomly generate random
number seeds:

1. When writing a library of stochastic functions that take a seed as an
input argument, and some of these functions call multiple other such
stochastic functions. Dask is one such example [1].

2. When a library needs to produce results that are reproducible after
calling numpy.random.seed, but that do not want to use the functions in
numpy.random directly. This came up recently in a pandas pull request [2],
because we want to allow using RandomState objects as an alternative to
global state in numpy.random. A major advantage of this approach is that it
provides an obvious alternative to reusing the private numpy.random._mtrand
[3].

The implementation of this function (and the corresponding method on
RandomState) is almost trivial, and I've already written such a utility for
my code:

def random_seed():
# numpy.random uses uint32 seeds
np.random.randint(2 ** 32)

The advantage of adding a new method is that it avoids the need for
explanation by making the intent of code using this pattern obvious. So I
think it is a good candidate for inclusion in numpy.random.

Any opinions?

[1]
https://github.com/dask/dask/blob/e0b246221957c4bd618e57246f3a7ccc8863c494/dask/utils.py#L336
[2] https://github.com/pydata/pandas/pull/13161
[3] On a side note, if there's no longer a good reason to keep this object
private, perhaps we should expose it in our public API. It would certainly
be useful -- scikit-learn is already using it (see links in the pandas PR
above).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Changing the behavior of (builtins.)round (via the round dunder) to return an integer

2016-04-13 Thread Stephan Hoyer

On Wed, Apr 13, 2016 at 8:06 AM,  wrote:
>
> The difference is that Python 3 has long ints, (and doesn't have to
> overflow, AFAICS)
>

This is a good point. But if your float is so big that rounding it to an
integer would overflow int64, rounding is already a no-op. I'm sure this
has been done before but I would guess it's quite rare. I would be OK
raising in this situation, especially because np.around will still be
around returning floats.


> what happens with nan?
> I guess inf would overflow?
>

builtins.round raises for both of these (in Python 3) and I would propose
copying this behavior:

In [52]: round(float('inf'))
---
OverflowError Traceback (most recent call last)
 in ()
> 1 round(float('inf'))

OverflowError: cannot convert float infinity to integer

In [53]: round(float('nan'))
---
ValueErrorTraceback (most recent call last)
 in ()
> 1 round(float('nan'))

ValueError: cannot convert float NaN to integer
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Changing the behavior of (builtins.)round (via the round dunder) to return an integer

2016-04-13 Thread Stephan Hoyer

On Wed, Apr 13, 2016 at 12:42 AM, Antony Lee 
wrote:

> (Note that I am suggesting to switch to the new behavior regardless of the
> version of Python.)
>

I would lean towards making this change only for Python 3. This is arguably
more consistent with Python than changing the behavior on Python 2.7, too.

The most obvious way in which a float being surprisingly switched to an
integer could cause silent bugs (rather than noisy TypeErrors) is if the
number is used in division. True division in Python 3 eliminates this risk.

Generally, I agree with your reasoning. It would be unfortunate to be stuck
with this legacy behavior forever.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-04-11 Thread Stephan Hoyer

On Mon, Apr 11, 2016 at 5:39 AM, Matěj Týč  wrote:

> * ... I do see some value in providing a canonical right way to
> construct shared memory arrays in NumPy, but I'm not very happy with
> this solution, ... terrible code organization (with the global
> variables):
> * I understand that, however this is a pattern of Python
> multiprocessing and everybody who wants to use the Pool and shared
> data either is familiar with this approach or has to become familiar
> with[2, 3]. The good compromise is to have a separate module for each
> parallel calculation, so global variables are not a problem.
>

OK, we can agree to disagree on this one. I still don't think I could get
code using this pattern checked in at my work (for good reason).

> * If there's some way to we can paper over the boilerplate such that
>
users can use it without understanding the arcana of multiprocessing,
> then yes, that would be great. But otherwise I'm not sure there's
> anything to be gained by putting it in a library rather than referring
> users to the examples on StackOverflow [1] [2].
> * What about telling users: "You can use numpy with multiprocessing.
> Remeber the multiprocessing.Value and multiprocessing.Aray classes?
> numpy.shm works exactly the same way, which means that it shares their
> limitations. Refer to an example: ." Notice that
> although those SO links contain all of the information, it is very
> difficult to get it up and running for a newcomer like me few years
> ago.
>

I guess I'm still not convinced this is the best we can with the
multiprocessing library. If we're going to do this, then we definitely need
to have the fully canonical example.

For example, could you make the shared array a global variable and then
still pass references to functions called by the processes anyways? The
examples on stackoverflow that we're both looking are varied enough that
it's not obvious to me that this is as good as it gets.

* This needs tests and justification for custom pickling methods,
> which are not used in any of the current examples. ...
> * I am sorry, but don't fully understand that point. The custom
> pickling method of shmarray has to be there on Windows, but users
> don't have to know about it at all. As noted earlier, the global
> variable is the only way of using standard Python multiprocessing.Pool
> with shared objects.
>

That sounds like a fine justification, but given that it wasn't obvious you
needs a comment saying as much in the source code :). Also, it breaks
pickle, which is another limitation that needs to be documented.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Changes to generalized ufunc core dimension checking

2016-03-20 Thread Stephan Hoyer

On Thu, Mar 17, 2016 at 3:28 PM, Jaime Fernández del Río <
jaime.f...@gmail.com> wrote:

> Would the logic for such a thing be consistent? E.g. how do you decide if
> the user is requesting (k),(k)->(), or (k),()->() with broadcasting over a
> non-core dimension of size k in the second argument? What if your
> signatures are (m, k),(k)->(m) and (k),(n,k)->(n) and your two inputs are
> (m,k) and (n,k), how do you decide which one to call? Or alternatively, how
> do you detect and forbid such ambiguous designs? Figuring out the dispatch
> rules for the general case seems like a non-trivial problem to me.
>

I would require a priority order for the core signatures when the gufunc is
created and only allow one implementation per argument dimension in the
core signature (i.e., disallow multiple implementations like (k),(k)->()
and (k),(m)->()).

The rule would be to dispatch to the implementation with the first core
signature with the right number of axes. The later constraint ensures that
(m,n) @ (k,n) errors if k != n, rather than attempting vectorized
matrix-vector multiplication. For matmul/@, the priority order is pretty
straightforward:
1. (m,n),(n,k)->(n,k)
2. (m,n),(n)->(m)
3. (m),(m,n)->(n)
4. (m),(m)->()

(2 and 3 could be safely interchanged.)

For scenarios like "(k),(k)->(), or (k),()->()", the only reasonable choice
would be to put (k),(k)->() first -- otherwise it never gets called. For
the other ambiguous case, "(m, k),(k)->(m) and (k),(n,k)->(n)", the
implementer would also need to pick an order.

Most of the tricky cases for multiple dispatch arise from extensible
systems (e.g., Matthew Rocklin's multipledispatch library), where you
allow/encourage third party libraries to add their own implementations and
need to be sure the combined result is still consistent. I wouldn't suggest
such a system for NumPy -- I think it's fine to require every gufunc to
have a single owner. There are other solutions for allowing extensibility
to duck array types (namely, __numpy_ufunc__).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Changes to generalized ufunc core dimension checking

2016-03-19 Thread Stephan Hoyer

On Thu, Mar 17, 2016 at 2:49 PM, Travis Oliphant 
wrote:

> That's a great idea!
>
> Adding multiple-dispatch capability for this case could also solve a lot
> of issues that right now prevent generalized ufuncs from being the
> mechanism of implementation of *all* NumPy functions.
>
> -Travis
>

For future reference, there's already an issue on GitHub about adding an
axis argument to gufuncs:
https://github.com/numpy/numpy/issues/5197

(see also the referenced mailing list discussion from that page.)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] fromnumeric.py internal calls

2016-02-28 Thread Stephan Hoyer

I think this is an improvement, but I do wonder if there are libraries out
there that use *args instead of **kwargs to handle these extra arguments.
Perhaps it's worth testing this change against third party array libraries
that implement their own array like classes? Off the top of my head, maybe
scipy, pandas, dask, astropy, pint, xarray?
On Wed, Feb 24, 2016 at 3:40 AM G Young  wrote:

> Hello all,
>
> I have PR #7325  up that
> changes the internal calls for functions in *fromnumeric.py* from
> positional arguments to keyword arguments.  I made this change for two
> reasons:
>
> 1) It is consistent with the external function signature
> 2)
>
> The inconsistency caused a breakage in *pandas* in its own implementation
> of *searchsorted* in which the *sorter* argument is not really used but
> is accepted so as to make it easier for *numpy* users who may be used to
> the *searchsorted* signature in *numpy*.
>
> The standard in *pandas* is to "swallow" those unused arguments into a
> *kwargs* argument so that we don't have to document an argument that we
> don't really use.  However, that turned out not to be possible when
> *searchsorted* is called from the *numpy* library.
>
> Does anyone have any objections to the changes I made?
>
> Thanks!
>
> Greg
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Generalized flip function

2016-02-28 Thread Stephan Hoyer

I also think this is a good idea -- the generalized flip is much more
numpythonic than the specialized 2d versions.
On Fri, Feb 26, 2016 at 11:36 AM Joseph Fox-Rabinovitz <
jfoxrabinov...@gmail.com> wrote:

> If nothing else, this is a nice complement to the generalized `stack`
> function.
>
> -Joe
>
> On Fri, Feb 26, 2016 at 11:32 AM, Eren Sezener 
> wrote:
> > Hi,
> >
> > In PR #7346 we add a flip function that generalizes fliplr and flipud for
> > arbitrary axes.
> >
> > flipud and fliplr reverse the elements of an array along axis=0 and
> axis=1
> > respectively. The new flip function reverses the elements of an array
> along
> > any given axis. In case flip is called with axis=0 or axis=1, the
> function
> > is equivalent to flipud and fliplr respectively.
> >
> > A similar function is also available in MATLAB™.
> >
> > We use this function in PR #7347 to generalize the rot90 function to
> rotate
> > an arbitrary plane (defined by the axes argument) of a multidimensional
> > array. By that we fix issue #6506.
> >
> > Because flip function introduces a new API, @shoyer asked us to consult
> the
> > mailing list.
> >
> > Any objection to adding the generalized flip function?
> >
> > Best regards,
> > C. Eren Sezener & Denis Alevi
> >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] GSoC?

2016-02-16 Thread Stephan Hoyer

On Wed, Feb 10, 2016 at 4:22 PM, Chris Barker  wrote:

> We might consider adding "improve duck typing for numpy arrays"
>>
>
> care to elaborate on that one?
>
> I know it come up on here that it would be good to have some code in numpy
> itself that made it easier to make array-like objects (I.e. do indexing the
> same way) Is that what you mean?
>

I was thinking particularly of improving the compatibility of numpy
functions (e.g., concatenate) with non-numpy array-like objects, but now
that you mention it utilities to make it easier to make array-like objects
could also be a good thing.

In any case, I've now elaborated on my thought into a full project idea on
the Wiki:
https://github.com/scipy/scipy/wiki/GSoC-2016-project-ideas#improved-duck-typing-support-for-n-dimensional-arrays

Arguably, this might be too difficult for most GSoC students -- the API
design questions here are quite contentious. But given that "Pythonic
dtypes" is up there as a GSoC proposal still it's in good company.

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Deprecating `numpy.iterable`

2016-02-11 Thread Stephan Hoyer

We certainly can (and probably should) deprecate this, but we can't remove
it for a very long time.

np.iterable is used in a lot of third party code.

On Wed, Feb 10, 2016 at 7:09 PM, Joseph Fox-Rabinovitz <
jfoxrabinov...@gmail.com> wrote:

> I have created a PR to deprecate `np.iterable`
> (https://github.com/numpy/numpy/pull/7202). It is a very old function,
> introduced as a utility in 2005
> (
> https://github.com/numpy/numpy/commit/052a7b2e3276a303be1083022fc24d43084d2e14
> ),
> and there is no good reason for it to be part of the public API. It is
> used internally 10 times within numpy. I have repaced those usages
> with a private function `np.lib.function_base._iterable` and added a
> `DeprecationWarning` to the public function.
>
> Is there anyone that objects to deprecating this function?
>
> Regards,
>
> -Joseph
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] GSoC?

2016-02-10 Thread Stephan Hoyer

On Wed, Feb 10, 2016 at 3:02 PM, Ralf Gommers 
wrote:

> OK first version:
> https://github.com/scipy/scipy/wiki/GSoC-2016-project-ideas
> I kept some of the ideas from last year, but removed all potential mentors
> as the same people may not be available this year - please re-add
> yourselves where needed.
>
> And to everyone who has a good idea, and preferably is willing to mentor
> for that idea: please add it to that page.
>
> Ralf
>

I removed the "Improve Numpy datetime functionality" project, since the
relevant improvements have mostly already made it into NumPy 1.11.

We might consider adding "improve duck typing for numpy arrays" if any GSOC
students are true masochists ;). I could potentially be a mentor for this
one, though of course Nathaniel is the obvious choice.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: xarray (formerly xray) v0.7.0 released

2016-01-21 Thread Stephan Hoyer

I am pleased to announce version v0.7.0 of xarray, the project formerly
known as xray.

xarray is an open source project and Python package that aims to bring the
labeled data power of pandas to the physical sciences, by providing
N-dimensional variants of the core pandas data structures. These data
structures are based on the data model of the netCDF file format.

In this latest release, we have renamed the project from "xray" to
"xarray". This avoids a namespace conflict with the entire field of X-ray
science. We have new URLs for our documentation, source code and mailing
list:
http://xarray.pydata.org/
http://github.com/pydata/xarray/
https://groups.google.com/forum/#!forum/xarray

Highlights of this release:

* An internal refactor of DataArray internals
* New methods for reshaping, rolling and shifting data
* Preliminary support for pandas.MultiIndex
* Support for reading GRIB, HDF4 and other file formats via PyNIO

For more details, read the full release notes:
http://xarray.pydata.org/en/stable/whats-new.html

Contributors to this release:

Antony Lee
Fabien Maussion
Joe Hamman
Maximilian Roos
Stephan Hoyer
Takeshi Kanmae
femtotrader

I'd also like to highlight the contributions of Clark Fitzgerald, who added
a plotting module to xray in v0.6, and Dave Brown, for his assistance
adding PyNIO support.

Best,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Software Capabilities of NumPy in Our Tensor Survey Paper

2016-01-15 Thread Stephan Hoyer

Robert beat me to it on einsum, but also check tensordot for general tensor 
contraction.

On Fri, Jan 15, 2016 at 9:30 AM, Nathaniel Smith  wrote:

> On Jan 15, 2016 8:36 AM, "Li Jiajia"  wrote:
>>
>> Hi all,
>> I’m a PhD student in Georgia Tech. Recently, we’re working on a survey
> paper about tensor algorithms: basic tensor operations, tensor
> decomposition and some tensor applications. We are making a table to
> compare the capabilities of different software and planning to include
> NumPy. We’d like to make sure these parameters are correct to make a fair
> compare. Although we have looked into the related documents, please help us
> to confirm these. Besides, if you think there are more features of your
> software and a more preferred citation, please let us know. We’ll consider
> to update them. We want to show NumPy supports tensors, and we also include
> "scikit-tensor” in our survey, which is based on NumPy.
>> Please let me know any confusion or any advice!
>> Thanks a lot! :-)
>>
>> Notice:
>> 1. “YES/NO” to show whether or not the software supports the operation or
> has the feature.
>> 2. “?” means we’re not sure of the feature, and please help us out.
>> 3. “Tensor order” means the maximum number of tensor dimensions that
> users can do with this software.
>> 4. For computational cores,
>> 1) "Element-wise Tensor Operation (A * B)” includes element-wise
> add/minus/multiply/divide, also Kronecker, outer and Katri-Rao products. If
> the software contains one of them, we mark “YES”.
>> 2) “TTM” means tensor-times-matrix multiplication. We distinguish TTM
> from tensor contraction. If the software includes tensor contraction, it
> can also support TTM.
>> 3) For “MTTKRP”, we know most software can realize it through the above
> two operations. We mark it “YES”, only if an specified optimization for the
> whole operation.
> NumPy has support for working with multidimensional tensors, if you like,
> but it doesn't really use the tensor language and notation (preferring
> instead to think in terms of "arrays" as a somewhat more computationally
> focused and less mathematically focused conceptual framework).
> Which is to say that I actually have no idea what all those jargon terms
> you're asking about mean :-) I am suspicious that NumPy supports more of
> those operations than you have marked, just under different names/notation,
> but really can't tell either way for sure without knowing what exactly they
> are.
> (It is definitely correct though that NumPy includes no support for sparse
> tensors, and NumPy itself is not multi-threaded beyond what we get for free
> through the BLAS, though there are external libraries that can perform
> multi-threaded computations on top of data stored in numpy arrays.)
> -n___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Stephan Hoyer

On Thu, Jan 14, 2016 at 2:30 PM, Nathaniel Smith  wrote:

> The reason I didn't suggest dask is that I had the impression that
> dask's model is better suited to bulk/streaming computations with
> vectorized semantics ("do the same thing to lots of data" kinds of
> problems, basically), whereas it sounded like the OP's algorithm
> needed lots of one-off unpredictable random access.
>
> Obviously even if this is true then it's useful to point out both
> because the OP's problem might turn out to be a better fit for dask's
> model than they indicated -- the post is somewhat vague :-).
>
> But, I just wanted to check, is the above a good characterization of
> dask's strengths/applicability?
>

Yes, dask is definitely designed around setting up a large streaming
computation and then executing it all at once.

But it is pretty flexible in terms of what those specific computations are,
and can also work for non-vectorized computation (especially via dask
imperative). It's worth taking a look at dask's collections for a sense of
what it can do here. The recently refreshed docs provide a nice overview:
http://dask.pydata.org/

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Stephan Hoyer

On Thu, Jan 14, 2016 at 8:26 AM, Travis Oliphant 
wrote:

> I don't know enough about xray to know whether it supports this kind of
> general labeling to be able to build your entire data-structure as an x-ray
> object.   Dask could definitely be used to process your data in an easy to
> describe manner (creating a dask.bag of dask.arrays would work though I'm
> not sure there are any methods that would buy you from just having a
> standard dictionary of dask.arrays).   You can definitely use dask
> imperative to parallelize your data-manipulation algorithms.
>

Indeed, xray's data model is not flexible enough to represent this sort of
data -- it's designed around cases where multiple arrays use shared axes.

However, I would indeed recommend dask.array (coupled with some sort of
on-disk storage) as a possible solution for this problem, if you need to be
able manipulate these arrays with an API that looks like NumPy. That said,
the fact that your data consists of ragged arrays suggests that the
dask.array API may be less useful for you.

Tools like dask.imperative, coupled with HDF5 for storage, could still be
very useful, though.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Dynamic array list implementation

2015-12-23 Thread Stephan Hoyer

We have a type similar to this (a typed list) internally in pandas, although it 
is restricted to a single dimension and far from feature complete -- it only 
has .append and a .to_array() method for converting to a 1d numpy array. Our 
version is written in Cython, and we use it for performance reasons when we 
would otherwise need to create a list of unknown length:

https://github.com/pydata/pandas/blob/v0.17.1/pandas/hashtable.pyx#L99

In my experience, it's several times faster than using a builtin list from 
Cython, which makes sense given that it needs to copy about 1/3 the data (no 
type or reference count for individual elements). Obviously, it uses 1/3 the 
space to store the data, too. We currently don't expose this object externally, 
but it could be an interesting project to adapt this code into a standalone 
project that could be more broadly useful.

Cheers,

Stephan

On Tue, Dec 22, 2015 at 8:20 PM, Chris Barker 
wrote:

> sorry for being so lazy as to not go look at the project pages, but
> This sounds like it could be really useful, and maybe supercise a coupl eof
> half-baked projects of mine. But -- what does "dynamic" mean?
> - can you append to these arrays?
> - can it support "ragged arrrays" -- it looks like it does.
>>
>> >>> L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
>> >>> print(L)
>> [[0], [1 2], [3 4 5], [6 7 8 9]]
>>
>> so this looks like a ragged array -- but what do you get when you do:
> for row in L:
> print row
>> >>> print(L.data)
>> [0 1 2 3 4 5 6 7 8
>>
>> is .data a regular old 1-d numpy array?
 L = ArrayList( np.arange(10), [3,3,4])
>> >>> print(L)
>> [[0 1 2], [3 4 5], [6 7 8 9]]
>> >>> print(L.data)
>> [0 1 2 3 4 5 6 7 8 9]
>>
>> does an ArrayList act like a numpy array in other ways:
> L * 5
> L* some_array
> in which case, how does it do broadcasting???
> Thanks,
> -CHB
 L = ArrayList(["Hello", "world", "!"])
>> >>> print(L[0])
>> 'Hello'
>> >>> L[1] = "brave new world"
>> >>> print(L)
>> ['Hello', 'brave new world', '!']
>>
>>
>>
>> Nicolas
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> -- 
> Christopher Barker, Ph.D.
> Oceanographer
> Emergency Response Division
> NOAA/NOS/OR(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
> chris.bar...@noaa.gov___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] array of random numbers fails to construct

2015-12-08 Thread Stephan Hoyer

On Sun, Dec 6, 2015 at 3:55 PM, Allan Haldane 
 wrote:

>
> I've also often wanted to generate large datasets of random uint8 and
> uint16. As a workaround, this is something I have used:
>
> np.ndarray(100, 'u1', np.random.bytes(100))
>
> It has also crossed my mind that np.random.randint and np.random.rand
> could use an extra 'dtype' keyword. It didn't look easy to implement though.
>

Another workaround that avoids creating a copy is to use the view method,
e.g.,
np.random.randint(np.iinfo(int).min, np.iinfo(int).max,
size=(1,)).view(np.uint8)  # creates 8 random bytes

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Proposal for a new function: np.moveaxis

2015-11-04 Thread Stephan Hoyer

I've put up a pull request implementing a new function, np.moveaxis, as an
alternative to np.transpose and np.rollaxis:
https://github.com/numpy/numpy/pull/6630

This functionality has been discussed (even the exact function name)
several times over the years, but it never made it into a pull request. The
most pressing issue is that the behavior of np.rollaxis is not intuitive to
most users:
https://mail.scipy.org/pipermail/numpy-discussion/2010-September/052882.html
https://github.com/numpy/numpy/issues/2039
http://stackoverflow.com/questions/29891583/reason-why-numpy-rollaxis-is-so-confusing

In this pull request, I also allow the source and destination axes to be
sequences as well as scalars. This does not add much complexity to the
code, solves some additional use cases and makes np.moveaxis a proper
generalization of the other axes manipulation routines (see the pull
requests for details).

Best of all, it already works on ndarray duck types (like masked array and
dask.array), because they have already implemented transpose.

I think np.moveaxis would be a useful addition to NumPy -- I've found
myself writing helper functions with a subset of its functionality several
times over the past few years. What do you think?

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Nansum function behavior

2015-10-23 Thread Stephan Hoyer

Hi Charles,

You should read the previous discussion about this issue on GitHub:
https://github.com/numpy/numpy/issues/1721

For what it's worth, I do think the new definition of nansum is more
consistent.

If you want to preserve NaN if there are no non-NaN values, you can often
calculate this desired quantity from nanmean, which does return NaN if
there are only NaNs.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] when did column_stack become C-contiguous?

2015-10-18 Thread Stephan Hoyer

Looking at the git logs, column_stack appears to have been that way
(creating a new array with concatenate) since at least NumPy 0.9.2, way
back in January 2006:
https://github.com/numpy/numpy/blob/v0.9.2/numpy/lib/shape_base.py#L271

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Making datetime64 timezone naive

2015-10-13 Thread Stephan Hoyer

On Mon, Oct 12, 2015 at 12:38 AM, Nathaniel Smith  wrote:

>
> One possible strategy here would be to do some corpus analysis to find
> out whether anyone is actually using it, like I did for the ufunc ABI
> stuff:
>   https://github.com/njsmith/codetrawl
>   https://github.com/njsmith/ufunc-abi-analysis
>
> "datetime_to_string" is an easy token to search for, though it looks
> like enough people have their own functions named that that you'd have
> to do a bit of filtering to ignore non-numpy-related uses.

Yes, this is a good approach. I actually mistyped the name here -- it's
actually "datetime_as_string". A GitHub search does turn up a handful of
uses outside of NumPy:
https://github.com/search?utf8=%E2%9C%93=numpy.datetime_as_string+in%3Afile%2Cpath+NOT+numpy%2Fcore+NOT+test_datetime.py+NOT+arrayprint.py=Code=searchresults

That said, I'm not sure it's worth going to the trouble to ensure it
continues to work in the future. This function was entirely undocumented,
and doesn't even have an inspectable function signature.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Deprecating unitless timedelta64 and "safe" casting of integers to timedelta64

2015-10-13 Thread Stephan Hoyer

As part of the datetime64 cleanup I've been working on over the past few
days, I noticed that NumPy's casting rules for np.datetime64('NaT') were
not working properly:
https://github.com/numpy/numpy/pull/6465

This led to my discovery that NumPy currently supports unit-less timedeltas
(e.g., "np.timedelta64(5)"), which indicate some sort of generic time
difference. The current behavior is to take the time units from the other
argument when these are used in a binary operation.

Even worse, we currently support "safe" casting of integers to timedelta64,
which means that integer + datetime64 and integer + timedelta64 arithmetic
works:

In [4]: np.datetime64('2000-01-01T00') + 10
Out[4]: numpy.datetime64('2000-01-01T10:00-0800','h')

Based on the principle that NumPy's datetime support should mirror the
standard library as much as possible, both of these behaviors seem like a
bad idea. We have datetime types precisely to disambiguate these sort of
situations.

I'd like to propose deprecating such casting in NumPy 1.11, with the intent
of removing it entirely as soon as practical.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Making datetime64 timezone naive

2015-10-12 Thread Stephan Hoyer

As has come up repeatedly over the past few years, nobody seems to be very
happy with the way that NumPy's datetime64 type parses and prints datetimes
in local timezones.

The tentative consensus from last year's discussion was that we should make
datetime64 timezone naive, like the standard library's datetime.datetime:
http://thread.gmane.org/gmane.comp.python.numeric.general/57184

That makes sense to me, and it's exactly what I'd like to see happen for
NumPy 1.11. Here's my PR to make that happen:
https://github.com/numpy/numpy/pull/6453

As a temporary measure, we still will parse datetimes that include a
timezone specification by converting them to UTC, but will issue a
DeprecationWarning. This is important for a smooth transition, because at
the very least I suspect the "Z" modifier for UTC is widely used. Another
option would be to preserve this conversion indefinitely, without any
deprecation warning.

There's one (slightly) contentious API decision to make: What should we do
with the numpy.datetime_to_string function? As far as I can tell, it was
never documented as part of the NumPy API and has not been used very much
or at all outside of NumPy's own test suite, but it is exposed in the main
numpy namespace. If we can remove it, then we can delete and simplify a lot
more code related to timezone parsing and display. If not, we'll need to do
a bit of work so we can distinguish between the string representations of
timezone naive and UTC.

Best,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Make all comparisons with NaT false?

2015-10-11 Thread Stephan Hoyer

Currently, NaT (not a time) does not have any special treatment when used
in comparison with datetime64/timedelta64 objects.

This means that it's equal to itself, and treated as the smallest possible
value in comparisons, e.g., NaT == NaT and NaT < any_other_time.

To me, this seems a little crazy for a value meant to denote a
missing/invalid time -- NaT should really have the same comparison behavior
as NaN. That is, all comparisons with NaT should be false. The good news is
that updating this behavior turns out to be only a matter of adding a
single conditional to umath/loops.c.src -- most of the work would be fixing
tests.

Whether you call this an API change or a bug fix is somewhat of a judgment
call, but I believe this change is certainly consistent with the goals of
datetime64. It's also consistent with how NaT is used in pandas, which uses
its own wrappers around datetime64 precisely to fix these sorts of issues.

So I'm raising this here to get some opinions on the right path forward:
1. Is this a bug fix that we can backport to 1.10.x?
2. Is this an API change that should wait until 1.11?
3. Is this something where we need to start issuing warnings and deprecate
the existing behavior?

My vote would be for option 2. I think it's really a bug fix, but it would
break enough code that I wouldn't want to spring this on anybody in a bug
fix release. I'd rather not wait several releases on this one because that
will only exacerbate issues with being able to use datetime64 reliably.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Cython-based OpenMP-accelerated quartic polynomial solver

2015-10-06 Thread Stephan Hoyer

On Tue, Oct 6, 2015 at 1:14 AM, Daπid  wrote:

> One idea: what about creating a "parallel numpy"? There are a few
> algorithms that can benefit from parallelisation. This library would mimic
> Numpy's signature, and the user would be responsible for choosing the
> single threaded or the parallel one by just changing np.function(x, y) to
> pnp.function(x, y)
>

I would recommend taking a look at dask.array [1], which in many cases
works exactly like a parallel NumPy, though it also does lazy and
out-of-core computation. It's a new project, but it's remarkably mature --
we use it as an alternative array backend (to numpy) in xray, and it's also
being used by scikit-image.

[1] http://dask.pydata.org/en/latest/array.html

> If that were deemed a good one, what would be the best parallelisation
> scheme? OpenMP? Threads?
>

Dask uses threads. That works pretty well as long as all the hard work is
calling into something that releases the GIL (which includes NumPy, of
course).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Sign of NaN

2015-09-29 Thread Stephan Hoyer

On Tue, Sep 29, 2015 at 8:13 AM, Charles R Harris  wrote:

> Due to a recent commit, Numpy master now raises an error when applying the
> sign function to an object array containing NaN. Other options may be
> preferable, returning NaN for instance, so I would like to open the topic
> for discussion on the list.
>

We discussed this last month on the list and on GitHub:
https://mail.scipy.org/pipermail/numpy-discussion/2015-August/073503.html
https://github.com/numpy/numpy/issues/6265
https://github.com/numpy/numpy/pull/6269/files

The discussion was focused on what to do in the generic fallback case. Now
that I think about this more, I think it makes sense to explicitly check
for NaN in the unorderable case, and return NaN is the input is NaN. I
would not return NaN in general from unorderable objects, though -- in
general we should raise an error.

It sounds like Allan has already fixed this in his PR, but it also would
not be hard to add that logic to the existing code. Is this code in the
NumPy 1.10?

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] interpretation of the draft governance document (was Re: Governance model request)

2015-09-23 Thread Stephan Hoyer

Travis -- have you included all your email addresses in your GitHub profile? 
When I type git shortlog -ne, I see 2063 commits from your Continuum address 
that seem to be missing from the contributors page on github. Generally 
speaking, the git logs tend to be more reliable for these counts.

On Wed, Sep 23, 2015 at 3:12 PM, Travis Oliphant 
wrote:

>>
>>
>> Here is a list of the current Contributors to the main NumPy repository:
>>
>> [
>> https://github.com/numpy/numpy/graphs/contributors](https://github.com/numpy/numpy/graphs/contributors)
>>
>>
> One of the problems with this list is that my contributions to the project
> are extremely under-represented because the large majority of my commitment
> of code happened in 2005 to 2006 before github was used.  So, using
> this as a list of the contributors is quite misleading --- and there are a
> lot of people now looking only at lists like this one and it might confuse
> them why I care so much.So, if you are going to make this list public
> in a governance document like this, then I think some acknowledgement of
> the source of the original code and the contributors to that needs to also
> be made --- or you could just also point to the THANKS document which lists
> people up to about 2008.   Between 2008 and 2010 we will lose
> contributions, still and this can be acknowledged.
>> Consensus-based decision making by the community
>> 
>>
>> Normally, all project decisions will be made by consensus of all
>> interested Contributors. The primary goal of this approach is to ensure
>> that the people who are most affected by and involved in any given change
>> can contribute their knowledge in the confidence that their voices will be
>> heard, because thoughtful review from a broad community is the best
>> mechanism we know of for creating high-quality software.
>>
>> The mechanism we use to accomplish this goal may be unfamiliar for those
>> who are not experienced with the cultural norms around free/open-source
>> software development. We provide a summary here, and highly recommend that
>> all Contributors additionally read [Chapter 4: Social and Political
>> Infrastructure](
>> http://producingoss.com/en/producingoss.html#social-infrastructure) of
>> Karl Fogel's classic *Producing Open Source Software*, and in particular
>> the section on [Consensus-based Democracy](
>> http://producingoss.com/en/producingoss.html#consensus-democracy), for a
>> more detailed discussion.
>>
>> In this context, consensus does *not* require:
>>
>> - that we wait to solicit everybody's opinion on every change,
>> - that we ever hold a vote on anything,
>> - or that everybody is happy or agrees with every decision.
>>
>> For us, what consensus means is that we entrust *everyone* with the right
>> to veto any change if they feel it necessary. While this may sound like a
>> recipe for obstruction and pain, this is not what happens. Instead, we find
>> that most people take this responsibility seriously, and only invoke their
>> veto when they judge that a serious problem is being ignored, and that
>> their veto is necessary to protect the project. And in practice, it turns
>> out that such vetoes are almost never formally invoked, because their mere
>> possibility ensures that Contributors are motivated from the start to find
>> some solution that everyone can live with -- thus accomplishing our goal of
>> ensuring that all interested perspectives are taken into account.
>>
>> How do we know when consensus has been achieved? In principle, this is
>> rather difficult, since consensus is defined by the absence of vetos, which
>> requires us to somehow prove a negative. In practice, we use a combination
>> of our best judgement (e.g., a simple and uncontroversial bug fix posted on
>> GitHub and reviewed by a core developer is probably fine) and best efforts
>> (e.g., all substantive API changes must be posted to the mailing list in
>> order to give the broader community a chance to catch any problems and
>> suggest improvements; we assume that anyone who cares enough about NumPy to
>> invoke their veto right should be on the mailing list). If no-one bothers
>> to comment on the mailing list after a few days, then it's probably fine.
>> And worst case, if a change is more controversial than expected, or a
>> crucial critique is delayed because someone was on vacation, then it's no
>> big deal: we apologize for misjudging the situation, [back up, and sort
>> things out](
>> http://producingoss.com/en/producingoss.html#version-control-relaxation).
>>
>> If one does need to invoke a formal veto, then it should consist of:
>>
>> - an unambiguous statement that a veto is being invoked,
>> - an explanation of why it is being invoked, and
>> - a description of what conditions (if any) would convince the vetoer to
>> withdraw their veto.
>>
>> If all proposals for resolving some issue are vetoed, then the

Re: [Numpy-discussion] Governance model request

2015-09-22 Thread Stephan Hoyer

On Tue, Sep 22, 2015 at 2:33 AM, Travis Oliphant 
wrote:

> The FUD I'm talking about is the anti-company FUD that has influenced
> discussions in the past.I really hope that we can move past this.
>

I have mostly stayed out of the governance discussion, in deference to how
new I am in this community, but I do want to take a moment to speak up here
to echo Travis's concern about anti-company FUD.

Everyone invested in NumPy has their own projects, priorities and employers
which shape their agenda. As far as I can tell, Travis and Continuum have
only ever acted with the long term health of the scipy ecosystem in mind.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

2015-09-03 Thread Stephan Hoyer

>From my perspective, a major advantage to dtypes is composability. For
example, it's hard to write a library like dask.array (out of core arrays)
that can suppose holding any conceivable ndarray subclass (like MaskedArray
or quantity), but handling arbitrary dtypes is quite straightforward -- and
that dtype information can be directly passed on, without the container
library knowing anything about the library that implements the dtype.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.sign and object comparisons

2015-08-31 Thread Stephan Hoyer

On Mon, Aug 31, 2015 at 1:23 AM, Sebastian Berg 
wrote:

> That would be my gut feeling as well. Returning `NaN` could also make
> sense, but I guess we run into problems since we do not know the input
> type. So `None` seems like the only option here I can think of right
> now.
>

My inclination is that return NaN would be the appropriate choice. It's
certainly consistent with the behavior for float dtypes -- my expectation
for object dtype behavior is that it works exactly like applying the
np.sign ufunc to each element of the array individually.

On the other hand, I suppose there are other ways in which an object can
fail all those comparisons (e.g., NaT?), so I suppose we could return None.
But it would still be a weird outcome for the most common case. Ideally, I
suppose, np.sign would return an array with int-NA dtype, but that's a
whole different can of worms...

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Numpy helper function for getitem?

2015-08-26 Thread Stephan Hoyer

Indeed, the helper function I wrote for xray was not designed to handle
None/np.newaxis or non-1d Boolean indexers, because those are not valid
indexers for xray objects. I think it could be straightforwardly extended
to handle None simply by not counting them towards the total number of
dimensions.

On Tue, Aug 25, 2015 at 8:41 AM, Fabien fabien.mauss...@gmail.com wrote:

 I think that Stephan's function for xray is very useful. A possible
 improvement (probably at a certain performance cost) would be to be able
 to provide a shape instead of a number of dimensions. The output would
 then be slices with valid start and ends.

 Current behavior:
 In[9]: expanded_indexer(slice(None), 2)
 Out[9]: (slice(None, None, None), slice(None, None, None))

 With shape:
 In[9]: expanded_indexer(slice(None), (3, 4))
 Out[9]: (slice(0, 4, 1), slice(0, 5, 1))

 But if nobody needed something like this before me, I think that I might
 have a design problem in my code (still quite new to python).


Glad you found it helpful!

Python's slice object has the indices method which implements this logic,
e.g.,

In [15]: s = slice(None, 10)

In [16]: s.indices(100)
Out[16]: (0, 10, 1)

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Numpy helper function for getitem?

2015-08-23 Thread Stephan Hoyer

I don't think NumPy has a function like this (at least, not exposed to Python), 
but I wrote one for xray, expanded_indexer, that you are welcome to borrow:
https://github.com/xray/xray/blob/v0.6.0/xray/core/indexing.py#L10










Stephan



On Sunday, Aug 23, 2015 at 7:54 PM, Fabien fabien.mauss...@gmail.com, wrote:
Folks,


My search engine was not able to help me on this one, possibly because I 

don't know exactly *what* I am looking for.


I need to override __getitem__ for a class that wrapps a numpy array. I 

know the dimensions of my array (which can be variable from instance to 

instance), and I know what I want to do: for one preselected dimension, 

I need to select another slice than requested by the user, do something 

with the data, and return the variable.


I am looking for a function that helps me to clean the input of 

__getitem__. There are so many possible cases, when the user uses [:] or 

[..., 1:2] or [0, ..., :] and so forth. But all these cases have an 

equivalent index array of len(ndimensions) with only valid slice() 

objects in it. This array would be much easier for me to work with.


in pseudo code:


def __getitem__(self, item):

 # clean input

 item = np.clean_item(item, ndimensions=4)

 # Ok now item is guaranteed to be of len 4

 item[2] = slice()

 # Continue

 etc.


Is there such a function in numpy?


I hope I have been clear enough... Thanks a lot!


Fabien


___

NumPy-Discussion mailing list

NumPy-Discussion@scipy.org

http://mail.scipy.org/mailman/listinfo/numpy-discussion___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Flag for np.tile to use as_strided to reduce memory

2015-06-19 Thread Stephan Hoyer

On Fri, Jun 19, 2015 at 10:39 AM, Sebastian Berg sebast...@sipsolutions.net
 wrote:

 No, what tile does cannot be represented that way. If it was possible
 you can achieve the same using `np.broadcast_to` basically, which was
 just added though. There are some other things you can do, like rolling
 window (adding dimensions), maybe some day we should add that (or you
 want to take a shot ;)).

 - Sebastian


The one case where np.tile could be done using stride tricks is if the
dimension you want to repeat has size 1 or currently does not exist.
np.broadcast_to was an attempt to make this stuff less awkward, though it
still requries mixing in transposes.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: xray v0.5

2015-06-11 Thread Stephan Hoyer

I'm pleased to announce version 0.5 of xray, N-D labeled arrays and
datasets in Python.

xray is an open source project and Python package that aims to bring the
labeled data power of pandas to the physical sciences, by providing
N-dimensional variants of the core pandas data structures. These data
structures are based on the data model of the netCDF file format.

Highlights of this release:

* Support for parallel computation on arrays that don't fit in memory using
dask.array (see http://continuum.io/blog/xray-dask for more details)
* Support for multi-file datasets
* assign and fillna methods, based on the pandas methods of the same name.
* to_array and to_dataset methods for easier conversion between xray
Dataset and DataArray objects.
* Label based indexing with nearest neighbor lookups

For more details, read the full release notes:
http://xray.readthedocs.org/en/stable/whats-new.html

Best,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] matmul needs some clarification.

2015-06-03 Thread Stephan Hoyer

On Sat, May 30, 2015 at 3:23 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:

 The problem arises when multiplying a stack of matrices times a vector.
 PEP465 defines this as appending a '1' to the dimensions of the vector and
 doing the defined stacked matrix multiply, then removing the last dimension
 from the result. Note that in the middle step we have a stack of matrices
 and after removing the last dimension we will still have a stack of
 matrices. What we want is a stack of vectors, but we can't have those with
 our conventions. This makes the result somewhat unexpected. How should we
 resolve this?


I'm afraid I don't quite understand the issue. Maybe a more specific
example of the shapes you have in mind would help? Here's my attempt.

Suppose we have two arrays:
a with shape (i, j, k)
b with shape (k,)

Following the logic you describe from PEP465, for a @ b we have shapes
transform like so:
(i, j, k,) @ (k, 1) - (i, j, 1) - (i, j)

This makes sense to me as a stack of vectors, as long as you are imagining
the original stack of matrices as along the first dimension. Which I'll
note is the default behavior for the new np.stack (
https://github.com/numpy/numpy/pull/5605).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposed deprecations for 1.10: dot corner cases

2015-05-11 Thread Stephan Hoyer

On Sat, May 9, 2015 at 1:26 PM, Nathaniel Smith n...@pobox.com wrote:

 I'd like to suggest that we go ahead and add deprecation warnings to
 the following operations. This doesn't commit us to changing anything
 on any particular time scale, but it gives us more options later.


These both get a strong +1 from me.

How long has the outer product behavior for np.dot been around?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposed deprecations for 1.10: dot corner cases

2015-05-11 Thread Stephan Hoyer

On Mon, May 11, 2015 at 2:53 PM, Alan G Isaac alan.is...@gmail.com wrote:

 I agree that where `@` and `dot` differ in behavior, this should be
 clearly documented.
 I would hope that the behavior of `dot` would not change.


Even if np.dot never changes (and indeed, perhaps it should not), issuing
these warnings seems like a good idea to me, once we have @ implemented
with the new behavior (and the @ operator backported from Python 3.5 as a
numpy function).

I expect that this warning would serve the useful purpose of reminding
users writing code intended to be used on earlier versions of numpy/python
that @ and np.dot don't work exactly the same way. As Nathaniel already
mentioned, it is quite straightforward to implement the outer product
behavior using the new @ behavior, so it will not be much of a hassle to
update code to remove the warning.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Bug in np.nonzero / Should index returning functions return ndarray subclasses?

2015-05-09 Thread Stephan Hoyer

With regards to np.where -- shouldn't where be a ufunc, so subclasses or other 
array-likes can be control its behavior with __numpy_ufunc__?


As for the other indexing functions, I don't have a strong opinion about how 
they should handle subclasses. But it is certainly tricky to attempt to handle 
handle arbitrary subclasses. I would agree that the least error prone thing to 
do is usually to return base ndarrays. Better to force subclasses to override 
methods explicitly.___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Advanced indexing: fancy vs. orthogonal

2015-04-03 Thread Stephan Hoyer

On Fri, Apr 3, 2015 at 10:59 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 I have an all-Pyhton implementation of an OrthogonalIndexer class, loosely
 based on Stephan's code plus some axis remapping, that provides all the
 needed functionality for getting and setting with orthogonal indices.


Awesome, thanks!


 Would those interested rather see it as a gist to play around with, or as
 a PR adding an orthogonally indexable `.ix_` argument to ndarray?


My preference would be for a PR (even if it's purely a prototype) because
it supports inline comments better than a gist.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Advanced indexing: fancy vs. orthogonal

2015-04-03 Thread Stephan Hoyer

On Fri, Apr 3, 2015 at 4:54 PM, Nathaniel Smith n...@pobox.com wrote:

 Unfortunately, AFAICT this means our only options here are to have
 some kind of backcompat break in numpy, some kind of backcompat break
 in pandas, or to do nothing and continue indefinitely with the status
 quo where the same indexing operation might silently return different
 results depending on the types passed in.


For what it's worth, DataFrame.__getitem__ is also pretty broken in pandas
(even worse than in NumPy). Not even the pandas devs can keep straight how
it works!
https://github.com/pydata/pandas/issues/9595

So we'll probably need a backwards incompatible switch there at some point,
too.

That said, the issues are somewhat different, and in my experience the
strict label and integer based indexers .loc and .iloc work pretty well. I
haven't heard any complaints about how they do cartesian indexing rather
than fancy indexing.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Advanced indexing: fancy vs. orthogonal

2015-04-02 Thread Stephan Hoyer

On Wed, Apr 1, 2015 at 7:06 AM, Jaime Fernández del Río
jaime.f...@gmail.com wrote:

Is there any other package implementing non-orthogonal indexing aside from
numpy?

I think we can safely say that NumPy's implementation of broadcasting
indexing is unique :).

The issue is that many other packages rely on numpy for implementation of
custom array objects (e.g., scipy.sparse and scipy.io.netcdf). It's not
immediately obvious what sort of indexing these objects represent.

If the functionality is lacking, e,g, use of slices in `np.ix_`, I'm all
for improving that to provide the full functionality of orthogonal
indexing. I just need a little more convincing that those new
attributes/indexers are going to ever see any real use.

Orthogonal indexing is close to the norm for packages that implement
labeled data structures, both because it's easier to understand and
implement, and because it's difficult to maintain associations with labels
through complex broadcasting indexing.

Unfortunately, the lack of a full featured implementation of orthogonal
indexing has lead to that wheel being reinvented at least three times (in
Iris, xray [1] and pandas). So it would be nice to have a canonical
implementation that supports slices and integers in numpy for that reason
alone. This could be done by building on the existing `np.ix_` function,
but a new indexer seems more elegant: there's just much less noise with
`arr.ix_[:1, 2, [3]]` than `arr[np.ix_(slice(1), 2, [3])]`.

It's also well known that indexing with __getitem__ can be much slower than
np.take. It seems plausible to me that a careful implementation of
orthogonal indexing could close or eliminate this speed gap, because the
model for orthogonal indexing is so much simpler than that for broadcasting
indexing: each element of the key tuple can be applied separately along the
corresponding axis.

So I think there could be a real benefit to having the feature in numpy. In
particular, if somebody is up for implementing it in C or Cython, I would
be very pleased.

Cheers,
Stephan

[1] Here is my implementation of remapping from orthogonal to broadcasting
indexing. It works, but it's a real mess, especially because I try to
optimize by minimizing the number of times slices are converted into arrays:
https://github.com/xray/xray/blob/0d164d848401209971ded33aea2880c1fdc892cb/xray/core/indexing.py#L68
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Advanced indexing: fancy vs. orthogonal

2015-04-02 Thread Stephan Hoyer

On Thu, Apr 2, 2015 at 11:03 AM, Eric Firing efir...@hawaii.edu wrote:

 Fancy indexing is a horrible design mistake--a case of cleverness run
 amok.  As you can read in the Numpy documentation, it is hard to
 explain, hard to understand, hard to remember.


Well put!

I also failed to correct predict your example.


 So I think you should turn the question around and ask, What is the
 actual real-world use case for fancy indexing?  How often does real
 code rely on it?


I'll just note that Indexing with a boolean array with the same shape as
the array (e.g., x[x  0] when x has greater than 1 dimension) technically
falls outside a strict interpretation of orthogonal indexing. But there's
not any ambiguity in adding that as an extension to orthogonal indexing
(which otherwise does not allow ndim  1), so I think your point still
stands.

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Improve Numpy Datetime Functionality for Gsoc

2015-03-24 Thread Stephan Hoyer

The most recent discussion about datetime64 was back in March and April of
last year:
http://mail.scipy.org/pipermail/numpy-discussion/2014-March/thread.html#69554
http://mail.scipy.org/pipermail/numpy-discussion/2014-April/thread.html#69774

In addition to unfortunate timezone handling, datetime64 has a lot of bugs
-- so many that I don't bother reporting them. But if anyone ever plans on
working on them, I can certainly help to assemble a long list of the issues
(many of these are mentioned in the above threads).

Unfortunately, though I would love to see datetime64 fixed, I'm not really
a suitable mentor for this role (I don't know C),
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] GSoC students: please read

2015-03-23 Thread Stephan Hoyer

On Mon, Mar 23, 2015 at 2:21 PM, Ralf Gommers ralf.gomm...@gmail.com
wrote:

 It's great to see that this year there are a lot of students interested in
 doing a GSoC project with Numpy or Scipy. So far five proposals have been
 submitted, and it looks like several more are being prepared now.


Hi Ralf,

Is there a centralized place for non-mentors to view proposals and give
feedback?

Thanks,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy.stack -- which function, if any, deserves the name?

2015-03-16 Thread Stephan Hoyer

On Mon, Mar 16, 2015 at 1:50 AM, Stefan Otte stefan.o...@gmail.com wrote:

 Summarizing, my proposal is mostly concerned how to create block
 arrays from given arrays. I don't care about the name stack. I just
 used stack because it replaced hstack/vstack for me. Maybe bstack
 for block stack, or barray for block array?


Stefan -- thanks for sharing your perspective!

In conclusion, it sounds like we could safely use stack for my PR
(proposal 2), and use another name (perhaps block, barray or
block_array) for your proposal. I'm also not opposed to using a new verb
for my PR (the stacking alternative to concatenate), but I haven't come
up with any more descriptive alternatives.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] numpy.stack -- which function, if any, deserves the name?

2015-03-15 Thread Stephan Hoyer

In the past months there have been two proposals for new numpy functions
using the name stack:

1. np.stack for stacking like np.asarray(np.bmat(...))
http://thread.gmane.org/gmane.comp.python.numeric.general/58748/
https://github.com/numpy/numpy/pull/5057

2. np.stack for stacking along an arbitrary new axis (this was my proposal)
http://thread.gmane.org/gmane.comp.python.numeric.general/59850/
https://github.com/numpy/numpy/pull/5605

Both functions generalize the notion of stacking arrays from the existing
hstack, vstack and dstack, but in two very different ways. Both could be
useful -- but we can only call one stack. Which one deserves that name?

The existing *stack functions use the word stack to refer to combining
arrays in two similarly different ways:
a. For ND - ND stacking along an existing dimensions (like
numpy.concatenate and proposal 1)
b. For ND - (N+1)D stacking along new dimensions (like proposal 2).

I think it would be much cleaner API design if we had different words to
denote these two different operations. Concatenate for combine along an
existing dimension already exists, so my thought (when I wrote proposal
2), was that the verb stack could be reserved (going forward) for
combine along a new dimension. This also has the advantage of suggesting
that concatenate and stack are the two fundamental operations for
combining N-dimensional arrays. The documentation on this is currently
quite confusing, mostly because no function like that in proposal 2
currently exists.

Of course, the *stack functions have existed for quite some time, and in
many cases vstack and hstack are indeed used for concatenate like
functionality (e.g., whenever they are used for 2D arrays/matrices). So the
case is not entirely clear-cut. (We'll never be able to remove this
functionality from NumPy.)

In any case, I would appreciate your thoughts.

Best,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Custom __array_interface__ error

2015-03-13 Thread Stephan Hoyer

In my experience writing ndarray-like objects, you likely want to implement
__array__ instead of __array_interface__. The former gives you full control
to create the ndarray yourself.

On Fri, Mar 13, 2015 at 7:22 AM, Daniel Smith dgasm...@icloud.com wrote:

 Greetings everyone,
 I have a new project that deals with core and disk tensors wrapped into a
 single object so that the expressions are transparent to the user after the
 tensor is formed. I would like to add __array_interface__ to the core
 tensor and provide a reasonable error message if someone tries to call the
 __array_interface__ for a disk tensor. I may be missing something, but I do
 not see an obvious way to do this in the python layer.

 Currently I do something like:

 if ttype == “Core:
 self.__array_interface__ = self.tensor.ndarray_interface()
 else:
 self.__array_interface__ = {'typestr’: 'Only Core tensor
 types are supported.’}

 Which provides at least a readable error message if it is not a core
 tensor:
 TypeError: data type Only Core tensor types are supported. not understood

 A easy solution I see is to change numpy C side __array_interface__ error
 message to throw custom strings.

 In numpy/core/src/multiarray/ctors.c:2100 we have the __array_interface__
 conversion:

 if (!PyDict_Check(iface)) {
 Py_DECREF(iface);
 PyErr_SetString(PyExc_ValueError,
 Invalid __array_interface__ value, must be a dict);
 return NULL;
 }

 It could simply be changed to:

 if (!PyDict_Check(iface)) {
 if (PyString_Check(iface)){
 PyErr_SetString(PyExc_ValueError, iface);
 }
 else{
 PyErr_SetString(PyExc_ValueError,
 Invalid __array_interface__ value, must be a dict”);
 }
 Py_DECREF(iface);
 return NULL;
 }

 Thoughts?

 Cheers,
 -Daniel Smith
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] ANN: xray v0.4 released

2015-03-03 Thread Stephan Hoyer

I'm pleased to announce a major release of xray, v0.4.

xray is an open source project and Python package that aims to bring the
labeled data power of pandas to the physical sciences, by providing
N-dimensional variants of the core pandas data structures.

Our goal is to provide a pandas-like and pandas-compatible toolkit for
analytics on multi-dimensional arrays, rather than the tabular data for
which pandas excels. Our approach adopts the Common Data Model for
self-describing scientific data in widespread use in the Earth sciences:
xray.Dataset is an in-memory representation of a netCDF file.

Documentation: http://xray.readthedocs.org/
GitHub: https://github.com/xray/xray

Highlights of this release:

* Automatic alignment of index labels in arithmetic and when combining
arrays or datasets.
* Aggregations like mean now skip missing values by default.
* Relaxed equality rules in concat and merge for variables with equal
value(s) but different shapes.
* New drop method for dropping variables or index labels.
* Support for reindexing with a fill method like pandas.

For more details, read the full release notes:
http://xray.readthedocs.org/en/stable/whats-new.html

Best,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] [SciPy-User] Congratulations to Chris Barker...

2015-03-02 Thread Stephan Hoyer

Indeed, congratulations Chris!

Are there plans to write a vectorized version for NumPy? :)

On Mon, Mar 2, 2015 at 2:28 PM, Nathaniel Smith n...@pobox.com wrote:

 ...on the acceptance of his PEP! PEP 485 adds a math.isclose function
 to the standard library, encouraging people to do numerically more
 reasonable floating point comparisons.

 The PEP:
   https://www.python.org/dev/peps/pep-0485/

 The pronouncement:
   http://thread.gmane.org/gmane.comp.python.devel/151776/focus=151778

 -n

 --
 Nathaniel J. Smith -- http://vorpus.org
 ___
 SciPy-User mailing list
 scipy-u...@scipy.org
 http://mail.scipy.org/mailman/listinfo/scipy-user

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Objects exposing the array interface

2015-02-25 Thread Stephan Hoyer

On Wed, Feb 25, 2015 at 1:24 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 1. When converting these objects to arrays using PyArray_Converter, if
 the arrays returned by any of the array interfaces is not C contiguous,
 aligned, and writeable, a copy that is will be made. Proper arrays and
 subclasses are passed unchanged. This is the source of the error reported
 above.



When converting these objects to arrays using PyArray_Converter, if the
arrays returned by any of the array interfaces is not C contiguous,
aligned, and writeable, a copy that is will be made. Proper arrays and
subclasses are passed unchanged. This is the source of the error reported
above.

I'm not entirely sure I understand this -- how is PyArray_Convert used in
numpy? For example, if I pass a non-contiguous array to your class Foo,
np.asarray does not do a copy:

In [25]: orig = np.zeros((3, 4))[:2, :3]

In [26]: orig.flags
Out[26]:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [27]: subclass = Foo(orig)

In [28]: np.asarray(subclass)
Out[28]:
array([[ 0.,  0.,  0.],
   [ 0.,  0.,  0.]])

In [29]: np.asarray(subclass)[:] = 1

In [30]: np.asarray(subclass)
Out[30]:
array([[ 1.,  1.,  1.],
   [ 1.,  1.,  1.]])


But yes, this is probably a bug.

2. When converting these objects using PyArray_OutputConverter, as well as
 in similar code in the ufucn machinery, anything other than a proper array
 or subclass raises an error. This means that, contrary to what the docs on
 subclassing say, see below, you cannot use an object exposing the array
 interface as an output parameter to a ufunc


Here it might be a good idea to distinguish between objects that define
__array__ vs __array_interface__/__array_struct__. A class that defines
__array__ might not be very ndarray-like at all, but rather be something
that can be *converted* to an ndarray. For example, objects in pandas
define __array__, but updating the return value of df.__array__() in-place
will not necessarily update the DataFrame (e.g., if the frame had
inhomogeneous dtypes).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Objects exposing the array interface

2015-02-25 Thread Stephan Hoyer

On Wed, Feb 25, 2015 at 2:48 PM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 I am not really sure what the behavior of __array__ should be. The link
 to the subclassing docs I gave before indicates that it should be possible
 to write to it if it is writeable (and probably pandas should set the
 writeable flag to False if it cannot be reliably written to), but the
 obscure comment I mentioned seems to point to the opposite, that it should
 never be written to. This is probably a good moment in time to figure out
 what the proper behavior should be and document it.


It's one thing to rely on the result of __array__ being writeable. It's
another thing to rely on writing to that array to modify the original
array-like object.

Presuming the later would be a mistake. Let me give three categories of
examples where I know this would fail:
- pandas: for DataFrame objects with inhomogeneous dtype
- netCDF4 and other IO libraries: The array's data may be readonly on disk
or require a network call to access. The memory model may not even be able
to be cleanly mapped to numpy's (e.g., it may use chunked storage)
- blaze.Data: Blaze arrays use lazily evaluation and don't support mutation

As far as I know, none of these libraries produce readonly ndarray objects
from __array__. It can actually be highly convenient to return normal,
writeable ndarrays even if they don't modify the original source, because
this lets you do all the normal numpy stuff to the returned array,
including operations that mutate it.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] converting a list of tuples into an array of tuples?

2015-02-09 Thread Stephan Hoyer

It appears that the only reliable way to do this may be to use a loop to
modify an object arrays in-place. Pandas has a version of this written in
Cython:
https://github.com/pydata/pandas/blob/c1a0dbc4c0dd79d77b2a34be5bc35493279013ab/pandas/lib.pyx#L342

To quote Wes McKinney Seriously can't believe I had to write this function

Best,
Stephan

On Mon, Feb 9, 2015 at 8:31 AM, Benjamin Root ben.r...@ou.edu wrote:

 I am trying to write up some code that takes advantage of np.tile() on
 arbitrary array-like objects. I only want to tile along the first axis. Any
 other axis, if they exist, should be left alone. I first coerce the object
 using np.asanyarray(), tile it, and then coerce it back to the original
 type.

 The problem seems to be that some of my array-like objects are being
 over-coerced, particularly the list of tuples. I tried doing
 np.asanyarray(a, dtype='O'), but that still turns it into a 2-D array.

 Am I missing something?

 Thanks,
 Ben Root

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] New function: np.stack?

2015-02-05 Thread Stephan Hoyer

There are two usual ways to combine a sequence of arrays into a new array:
1. concatenated along an existing axis
2. stacked along a new axis

For 1, we have np.concatenate. For 2, we have np.vstack, np.hstack,
np.dstack and np.column_stack. For arrays with arbitrary dimensions, there
is the np.array constructor, possibly with transpose to get the result in
the correct order. (I've used this last option in the past but haven't been
especially happy with it -- it takes some trial and error to get the axis
swapping or transpose right for higher dimensional input.)

This methods are similar but subtly distinct, and none of them generalize
well to n-dimensional input. It seems like the function we are missing is
the plain np.stack, which takes the axis to stack along as a keyword
argument. The exact desired functionality is clearest to understand by
example:

 X = [np.random.randn(100, 200) for i in range(10)]
 stack(X, axis=0).shape
(10, 100, 200)
 stack(X, axis=1).shape
(100, 10, 200)
 stack(X, axis=2).shape
(100, 200, 10)

So I'd like to propose this new function for numpy. The desired signature
would be simply np.stack(arrays, axis=0). Ideally, the confusing mess of
other stacking functions could then be deprecated, though we could probably
never remove them.

Matthew Rocklin recent wrote an out of core version this for his dask
project (part of Blaze), which is what got me thinking about the need for
this functionality:
https://github.com/ContinuumIO/dask/pull/30

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Datetime again

2015-01-28 Thread Stephan Hoyer

On Wed, Jan 28, 2015 at 5:13 PM, Chris Barker chris.bar...@noaa.gov wrote:

 I tend to agree with Nathaniel that a ndarray subclass is less than ideal
 -- they tend to get ugly fast. But maybe that is the only way to do
 anything in Python, short of a major refactor to be able to write a dtype
 in Python -- which would be great, but sure sounds like a major project to
 me.


My vote would be for using composition rather than inheritance. So
DatetimeArray should contain but not be an ndarray, making use of
appropriate APIs like __array__, __array_wrap__ and __numpy_ufunc__.

And as for  The 64 bits of long long really isn't enough and leads to all
 sorts of compromises. not long enough for what? I've always thought that
 what we need is the ability to set the epoch. Does anyone ever need
 picoseconds since 100 years ago? And if they did, we'd be in a heck of a
 mess with leap seconds and all that anyway.


I agree pretty strongly with the Blaze docs with respect to time units. I
think fixed precision int64 is probably OK (simplifying things quite a
bit), but the ns precision chosen by pandas was probably a mistake (not a
big enough range). The main advantage of using a single array for the
underlying data is that it's very straightforward to drop in a Cython or
Numba or whatever for performance critical steps.

In my mind, the main advantage of using floating point math is that NaT
(not a time) becomes much easier to represent and work with -- you can
share map it to NaN. Handling NaT is a major source of complexity for the
datetime operations in pandas.

The other thing to consider is how much progress has been made on the
datetime dype in DyND, which is where the numpy replacement part of Blaze
has ended up. I know some sort of datetime object *has* been implemented,
though from my tests it does not really appear to be in fully working
condition at this point (e.g., there does not appear to be a corresponding
timedelta time):
https://github.com/libdynd/dynd-python

Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Add a function to broadcast arrays to a given shape to numpy's stride_tricks?

2015-01-03 Thread Stephan Hoyer

Here is an update on a new function for broadcasting arrays to a given
shape (now named np.broadcast_to).

I have a pull request up for review, which has received some feedback now:
https://github.com/numpy/numpy/pull/5371

There is still at least one design decision to settle: should we expose
broadcast_shape in the public API? In the current implementation, it is
exposed as a public function in numpy.lib.tride_tricks (like as_strided),
but it is not exported into the main numpy namespace. The alternatives
would be to either make it a private function (_broadcast_shape) or expose
it publicly (np.broadcast_shape).

Please do speak if you have any thoughts to share on the implementation,
either here or in the pull request.

Best,
Stephan


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Add a function to broadcast arrays to a given shape to numpy's stride_tricks?

2014-12-12 Thread Stephan Hoyer

On Fri, Dec 12, 2014 at 5:48 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 np.broadcast is the Python object of the old iterator. It may be a better
 idea to write all of these functions using the new one, np.nditer:

 def common_shape(*args):
 return np.nditer(args).shape[::-1]  # Yes, you do need to reverse it!


Unfortunately, that version does not seem to do what I'm looking for:

def common_shape(*args):
return np.nditer(args).shape[::-1]

x = np.empty((4,))
y = np.empty((2, 3, 4))
print(common_shape(x, y))

Outputs: (6, 4)

And in writing 'broadcast_to', rather than rewriting the broadcasting
 logic, you could check the compatibility of the shape with something like:

 np.nditer((arr,), itershape=shape)  # will raise ValueError if shapes
 incompatible

 After that, all that would be left is some prepending of zero strides, and
 some zeroing of strides of shape 1 dimensions before calling as_strided


Yes, that is a good idea.

Here is a gist with the latest version of this code (shortly to be turned
into a PR):
https://gist.github.com/shoyer/3e36af0a8196c82d4b42
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Add a function to broadcast arrays to a given shape to numpy's stride_tricks?

2014-12-12 Thread Stephan Hoyer

On Fri, Dec 12, 2014 at 6:25 AM, Jaime Fernández del Río 
jaime.f...@gmail.com wrote:

 it seems that all the functionality that has been discussed are one-liners
 using nditer: do we need new functions, or better documentation?


I think there is utility to adding a new function or two (my inclination is
to expose broadcast_to in the public API, but leave common_shape in
strick_tricks). NumPy provides all the cools to write these in a few lines,
but you need to know some very deep details of the NumPy API (nditer and
strides).

I don't think more documentation would make this obvious -- certainly
nditer does not need a longer docstring! The best sort of documentation
would be more examples. If this is a recipe that many NumPy users would
use, including it in stride_tricks would also serve such an educational
purpose (reading stride_tricks is how I figured out how strides work).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Add a function to broadcast arrays to a given shape to numpy's stride_tricks?

2014-12-11 Thread Stephan Hoyer

On Thu, Dec 11, 2014 at 8:17 AM, Sebastian Berg sebast...@sipsolutions.net
wrote:

 One option
 would also be to have something like:

 np.common_shape(*arrays)
 np.broadcast_to(array, shape)
 # (though I would like many arrays too)

 and then broadcast_ar rays could be implemented in terms of these two.


It looks like np.broadcast let's us write the common_shape function very
easily;

def common_shape(*args):
return np.broadcast(*args).shape

And it's also very fast:
100 loops, best of 3: 1.04 µs per loop

So that does seem like a feasible refactor/simplification for
np.broadcast_arrays.

Sebastian -- if you're up for writing np.broadcast_to in C, that's great!
If you're not sure if you'll be able to get around to that in the near
future, I'll submit my PR with a Python implementation (which will have
tests that will be useful in any case).
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

1 2 >

1 - 100 of 131 matches

Mail list logo