Re: [Numpy-discussion] guvectorize, a helper for writing generalized ufuncs

2016-09-13 Thread Travis Oliphant
There has been some discussion on the Numba mailing list as well about a
version of guvectorize that doesn't compile for testing and flexibility.

Having this be inside NumPy itself seems ideal.

-Travis


On Tue, Sep 13, 2016 at 12:59 PM, Stephan Hoyer  wrote:

> On Tue, Sep 13, 2016 at 10:39 AM, Nathan Goldbaum 
> wrote:
>
>> I'm curious whether you have a plan to deal with the python functional
>> call overhead. Numba gets around this by JIT-compiling python functions -
>> is there something analogous you can do in NumPy or will this always be
>> limited by the overhead of repeatedly calling a Python implementation of
>> the "core" operation?
>>
>
> I don't think there is any way to avoid this in NumPy proper, but that's
> OK (it's similar to the existing overhead of vectorize).
>
> Numba already has guvectorize (and it's own version of vectorize as well),
> which already does exactly this.
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>


-- 

*Travis Oliphant, PhD*
*Co-founder and CEO*


@teoliphant
512-222-5440
http://www.continuum.io
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Chris Barker
On Tue, Sep 13, 2016 at 11:05 AM, Lluís Vilanova 
wrote:

> Great, that's the type of info I wanted to get before going forward. I
> guess
> there's code relying on the binary representation of 'S' to do mmap's or
> access
> the array's raw contents. Is that right?


yes, there is a LOT of code, most of it third party, that relies on
particular binary representations of the numpy dtypes.

There is a fundamental semantic difference between a string and a byte
> array,
> that's the core of the problem.
>

well yes. but they were mingled in py2, and the 'S' dtype is essentially a
py2 string. But in py3, it maps more closely with bytes than string --
though yes, not exactly either :-(

Here's an alternative that only handles the repr.
>


> Whenever we repr an array using 'S', we can instead show a unicode in py3.
> That
> keeps the binary representation, but will always show the expected result
> to
> users, and it's only a handful of lines added to dump_data().
>

This would probably be more confusing than helpful -- if a 'S' object
converts to a bytes object, than it's repr should show that.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Stephan Hoyer
On Tue, Sep 13, 2016 at 11:05 AM, Lluís Vilanova 
wrote:

> Whenever we repr an array using 'S', we can instead show a unicode in py3.
> That
> keeps the binary representation, but will always show the expected result
> to
> users, and it's only a handful of lines added to dump_data().
>
> If needed, I could easily add a bytes array to make the alternative
> explicit
> (where py3 would repr the contents as b'foo').
>
> This would only leave the less-common paths inconsistent across python
> versions,
> which should not be a problem for most examples/doctests:
>
> * A 'U' array will show u'foo' in py2 and 'foo' in py3.
> * The new binary array will show 'foo' in py2 and b'foo' in py3 (that
> could also
>   be patched on the repr code).
> * A 'O' array will not be able to do any meaningful repr conversions.
>
>
> A more complex alternative (and actually closer to what I'm proposing) is
> to
> modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
> string. It would have the binary compatibility, while being a unicode
> string in
> practice.


I'm afraid these are both also non-starters at this point. NumPy's string
dtype corresponds to bytes on Python 3, and you can use it to store
arbitrary binary values. Would it really be an improvement to change the
repr, if the scalar value resulting from indexing is still bytes?

The sanest approach is probably a new dtype for one-byte strings. We talked
about this a few years ago, but nobody has implemented it yet:
http://numpy-discussion.scipy.narkive.com/3nqDu3Zk/a-one-byte-string-dtype

(normally I would link to the archives on scipy.org, but the certificate
for HTTPS has expired so you see a big error message right now...)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Lluís Vilanova
Chris Barker writes:

> We had a big long discussion about this on this list a while back (maybe 2 yrs
> ago???) please search the archives to find it. Though I'm pretty sure that we
> never did come to a conclusion. I think it stared with wanting better support
> ofr unicode in loadtxt and the like, and ended up delving into other encodings
> for the 'U' dtype, and maybe a single byte string dtype (latin-1), or maybe a
> variable-size unicode object like Py3's, or...

> However, it is absolutely a non-starter to change the binary representation of
> the 'S' type in any version of numpy. Due to the legacy of py2 (and, indeed,
> most computing environments) 'S' is a single byte string representation. And 
> the
> binary representation is often really key to numpy use.
> Period, end of story.

Great, that's the type of info I wanted to get before going forward. I guess
there's code relying on the binary representation of 'S' to do mmap's or access
the array's raw contents. Is that right?


> And that maps to a py2 string and py3 bytes object.

> py2 does, of course, have a Unicode object as well. If you want your code (and
> doctests, and ...) to be compatible, then you should probably go to Unicode
> strings everywhere. py3 now supports the u'string' no-op literal to make this
> easier.

> (though I guess the __repr__ won't tack on that 'u', which is going to be a
> problem for docstrings).

That's exactly the problem. Doing all examples and doctests with 'U' instead of
'S' will break it for py2 instead of py3.


> Note also that py3 has added more an more "string-like" support to the bytes
> object, so it's not too bad to go bytes-only.

There is a fundamental semantic difference between a string and a byte array,
that's the core of the problem.


Here's an alternative that only handles the repr. Separate fixes would be needed
for loadtxt's and genfromtxt's problems (Sevastian Berg briefly pointed at that,
but I'd like to know more).

Whenever we repr an array using 'S', we can instead show a unicode in py3. That
keeps the binary representation, but will always show the expected result to
users, and it's only a handful of lines added to dump_data().

If needed, I could easily add a bytes array to make the alternative explicit
(where py3 would repr the contents as b'foo').

This would only leave the less-common paths inconsistent across python versions,
which should not be a problem for most examples/doctests:

* A 'U' array will show u'foo' in py2 and 'foo' in py3.
* The new binary array will show 'foo' in py2 and b'foo' in py3 (that could also
  be patched on the repr code).
* A 'O' array will not be able to do any meaningful repr conversions.


A more complex alternative (and actually closer to what I'm proposing) is to
modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
string. It would have the binary compatibility, while being a unicode string in
practice.


Cheers,
  Lluis
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] guvectorize, a helper for writing generalized ufuncs

2016-09-13 Thread Stephan Hoyer
On Tue, Sep 13, 2016 at 10:39 AM, Nathan Goldbaum 
wrote:

> I'm curious whether you have a plan to deal with the python functional
> call overhead. Numba gets around this by JIT-compiling python functions -
> is there something analogous you can do in NumPy or will this always be
> limited by the overhead of repeatedly calling a Python implementation of
> the "core" operation?
>

I don't think there is any way to avoid this in NumPy proper, but that's OK
(it's similar to the existing overhead of vectorize).

Numba already has guvectorize (and it's own version of vectorize as well),
which already does exactly this.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] guvectorize, a helper for writing generalized ufuncs

2016-09-13 Thread Nathan Goldbaum
On Tue, Sep 13, 2016 at 11:47 AM, Stephan Hoyer  wrote:

> NumPy has the handy np.vectorize for turning Python code that operates on
> scalars into a function that vectorizes works like a ufunc, but no helper
> function for creating generalized ufuncs (http://docs.scipy.org/doc/
> numpy/reference/c-api.generalized-ufuncs.html).
>
> np.apply_along_axis accomplishes some of this, but it only allows a single
> core dimension on a single argument.
>
> So I propose adding a new object, np.guvectorize(pyfunc, signature,
> otypes, ...), where pyfunc is defined over the core dimensions only of any
> inputs and signature is any valid gufunc signature (a string). Calling this
> object would apply the gufunc. This is inspired by the similar
> numba.guvectorize, which is currently the easiest way to write a gufunc in
> Python.
>
> In addition to be handy like vectorize, such functionality would be
> especially useful for with working libraries that build upon NumPy to
> extend the capabilities of generalized ufuncs (e.g., xarray after
> https://github.com/pydata/xarray/pull/964).
>
>
First, this seems really cool. I hope it goes somewhere.

I'm curious whether you have a plan to deal with the python functional call
overhead. Numba gets around this by JIT-compiling python functions - is
there something analogous you can do in NumPy or will this always be
limited by the overhead of repeatedly calling a Python implementation of
the "core" operation?

-Nathan


> Cheers,
> Stephan
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Chris Barker
We had a big long discussion about this on this list a while back (maybe 2
yrs ago???) please search the archives to find it. Though I'm pretty sure
that we never did come to a conclusion. I think it stared with wanting
better support ofr unicode in loadtxt and the like, and ended up delving
into other encodings for the 'U' dtype, and maybe a single byte string
dtype (latin-1), or maybe a variable-size unicode object like Py3's, or...

However, it is absolutely a non-starter to change the binary representation
of the 'S' type in any version of numpy. Due to the legacy of py2 (and,
indeed, most computing environments) 'S' is a single byte string
representation. And the binary representation is often really key to numpy
use.
Period, end of story.

And that maps to a py2 string and py3 bytes object.

py2 does, of course, have a Unicode object as well. If you want your code
(and doctests, and ...) to be compatible, then you should probably go to
Unicode strings everywhere. py3 now supports the u'string' no-op literal to
make this easier.

(though I guess the __repr__ won't tack on that 'u', which is going to be a
problem for docstrings).

Note also that py3 has added more an more "string-like" support to the
bytes object, so it's not too bad to go bytes-only.

-CHB


On Tue, Sep 13, 2016 at 7:21 AM, Lluís Vilanova  wrote:

> Sebastian Berg writes:
>
> > On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
> >> Hi! I'm giving a shot to issue #3184 [1], based on the observation
> >> that the
> >> string dtype ('S') under python 3 uses byte arrays instead of unicode
> >> (the only
> >> readable string type in python 3).
> >>
> >> This brings two major problems:
> >>
> >> * numpy code has to go through loops to open and read files as binary
> >> data to
> >>   load text into a bytes array, and does not play well with users
> >> providing
> >>   string (unicode) arguments
> >>
> >> * the repr of these arrays shows strings as b'text' instead of
> >> 'text', which
> >>   breaks doctests of software built on numpy
> >>
> >> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
> >> (NPY_STRING and
> >> NPY_UNICODE).
> >>
> >> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
> >> internal
> >> implementation) will provide the best backwards compatibility, but is
> >> more
> >> cumbersome to implement.
>
> > I am not sure how that can be possible. Those types are fundamentally
> > different in how they store their data. String types use one byte per
> > character, unicode types will use 4 bytes per character. You can maybe
> > default to unicode in more cases in python 3, but you cannot make them
> > identical internally.
>
> BTW, by identical I mean having two externally visible types, but a common
> implementation in python 3 (that of NPY_UNICODE).
>
> The as-sane but not backwards-compatible option (I'm asking if this is
> acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE
> implementation, and making 'U' (and np.unicode_) and alias for 'S' (and
> np.string_).
>
>
> Cheers,
>   Lluis
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] guvectorize, a helper for writing generalized ufuncs

2016-09-13 Thread Stephan Hoyer
NumPy has the handy np.vectorize for turning Python code that operates on
scalars into a function that vectorizes works like a ufunc, but no helper
function for creating generalized ufuncs (
http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html).

np.apply_along_axis accomplishes some of this, but it only allows a single
core dimension on a single argument.

So I propose adding a new object, np.guvectorize(pyfunc, signature, otypes,
...), where pyfunc is defined over the core dimensions only of any inputs
and signature is any valid gufunc signature (a string). Calling this object
would apply the gufunc. This is inspired by the similar numba.guvectorize,
which is currently the easiest way to write a gufunc in Python.

In addition to be handy like vectorize, such functionality would be
especially useful for with working libraries that build upon NumPy to
extend the capabilities of generalized ufuncs (e.g., xarray after
https://github.com/pydata/xarray/pull/964).

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Lluís Vilanova
Sebastian Berg writes:

> On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
>> Hi! I'm giving a shot to issue #3184 [1], based on the observation
>> that the
>> string dtype ('S') under python 3 uses byte arrays instead of unicode
>> (the only
>> readable string type in python 3).
>> 
>> This brings two major problems:
>> 
>> * numpy code has to go through loops to open and read files as binary
>> data to
>>   load text into a bytes array, and does not play well with users
>> providing
>>   string (unicode) arguments
>> 
>> * the repr of these arrays shows strings as b'text' instead of
>> 'text', which
>>   breaks doctests of software built on numpy
>> 
>> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
>> (NPY_STRING and
>> NPY_UNICODE).
>> 
>> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
>> internal
>> implementation) will provide the best backwards compatibility, but is
>> more
>> cumbersome to implement.

> I am not sure how that can be possible. Those types are fundamentally
> different in how they store their data. String types use one byte per
> character, unicode types will use 4 bytes per character. You can maybe
> default to unicode in more cases in python 3, but you cannot make them
> identical internally.

BTW, by identical I mean having two externally visible types, but a common
implementation in python 3 (that of NPY_UNICODE).

The as-sane but not backwards-compatible option (I'm asking if this is
acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE
implementation, and making 'U' (and np.unicode_) and alias for 'S' (and
np.string_).


Cheers,
  Lluis
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Lluís Vilanova
Sebastian Berg writes:

> On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
>> Hi! I'm giving a shot to issue #3184 [1], based on the observation
>> that the
>> string dtype ('S') under python 3 uses byte arrays instead of unicode
>> (the only
>> readable string type in python 3).
>> 
>> This brings two major problems:
>> 
>> * numpy code has to go through loops to open and read files as binary
>> data to
>>   load text into a bytes array, and does not play well with users
>> providing
>>   string (unicode) arguments
>> 
>> * the repr of these arrays shows strings as b'text' instead of
>> 'text', which
>>   breaks doctests of software built on numpy
>> 
>> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
>> (NPY_STRING and
>> NPY_UNICODE).
>> 
>> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
>> internal
>> implementation) will provide the best backwards compatibility, but is
>> more
>> cumbersome to implement.

> I am not sure how that can be possible. Those types are fundamentally
> different in how they store their data. String types use one byte per
> character, unicode types will use 4 bytes per character. You can maybe
> default to unicode in more cases in python 3, but you cannot make them
> identical internally.

> What about giving `np.loadtxt` an encoding kwarg or something along
> that line?

np.loadtxt and np.genfromtxt are already quite complex in handling the implicit
conversion to byte-array imposed by numpy's port to python 3, and still fail in
some corner cases.

This conversion is also inherently surprising to users, since what I'd get in
python 2:

  >>> np.array('foo', dtype='S')
  array('foo', dtype='|S3')

In python 3 gives me a surprising (note the prefix on the resulting string):

  >>> np.array('foo', dtype='S')
  array(b'foo', dtype='|S3')

It's not only surprising, but also breaks absolutely all the doctests I have
with arrays that contain strings (it even breaks numpy's examples).

That's why adding an encoding kwarg (better than the current auto-magical
conversion to binary) won't solve my problems. The 'S' dtype will still be a
binary array, which shows up in the repr.


Since all strings in python 3 are unicode, I'm expecting "string" and "unicode"
arrays in numpy to be the same *and* show up as strings (e.g., 'foo' instead of
b'foo').

Yes, the difference between these types is in how they store their data. What
I'm proposing is to always use unicode in python 3.

If necessary, we can add a new dtype that lets users store raw byte arrays. By
making them explicitly byte arrays, that shouldn't raise any new surprises.


I already started doing the changes I described (as a result from the discussion
in #3184 [1]), but wanted to double-check with the list before getting deeper
into it.

[1] https://github.com/numpy/numpy/issues/3184


Cheers,
  Lluis
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Sebastian Berg
On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
> Hi! I'm giving a shot to issue #3184 [1], based on the observation
> that the
> string dtype ('S') under python 3 uses byte arrays instead of unicode
> (the only
> readable string type in python 3).
> 
> This brings two major problems:
> 
> * numpy code has to go through loops to open and read files as binary
> data to
>   load text into a bytes array, and does not play well with users
> providing
>   string (unicode) arguments
> 
> * the repr of these arrays shows strings as b'text' instead of
> 'text', which
>   breaks doctests of software built on numpy
> 
> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
> (NPY_STRING and
> NPY_UNICODE).
> 
> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
> internal
> implementation) will provide the best backwards compatibility, but is
> more
> cumbersome to implement.

I am not sure how that can be possible. Those types are fundamentally
different in how they store their data. String types use one byte per
character, unicode types will use 4 bytes per character. You can maybe
default to unicode in more cases in python 3, but you cannot make them
identical internally.

What about giving `np.loadtxt` an encoding kwarg or something along
that line?

- Sebastian


> 
> Is it acceptable to internally just translate all appearances of 'S'
> (NPY_STRING) to 'U' (NPY_UNICODE) and get rid of one of the two when
> running in
> python 3?
> 
> The main drawback I see is that dtype reprs would not always be as
> expected:
> 
>    # python 2
>    >>> np.array('foo', dtype='S')
>    array('foo',
>  dtype='|S3')
> 
>    # python 3
>    >>> np.array('foo', dtype='S')
>    array('foo',
>  dtype=' 
> 
> [1] https://github.com/numpy/numpy/issues/3184
> 
> 
> Cheers,
>   Lluis
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 

signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Lluís Vilanova
Hi! I'm giving a shot to issue #3184 [1], based on the observation that the
string dtype ('S') under python 3 uses byte arrays instead of unicode (the only
readable string type in python 3).

This brings two major problems:

* numpy code has to go through loops to open and read files as binary data to
  load text into a bytes array, and does not play well with users providing
  string (unicode) arguments

* the repr of these arrays shows strings as b'text' instead of 'text', which
  breaks doctests of software built on numpy

What I'm trying to do is make dtypes 'S' and 'U' equivalnt (NPY_STRING and
NPY_UNICODE).

Now the question. Keeping 'S' and 'U' as separate dtypes (but same internal
implementation) will provide the best backwards compatibility, but is more
cumbersome to implement.

Is it acceptable to internally just translate all appearances of 'S'
(NPY_STRING) to 'U' (NPY_UNICODE) and get rid of one of the two when running in
python 3?

The main drawback I see is that dtype reprs would not always be as expected:

   # python 2
   >>> np.array('foo', dtype='S')
   array('foo',
 dtype='|S3')

   # python 3
   >>> np.array('foo', dtype='S')
   array('foo',
 dtype='