Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-14 Thread Lluís Vilanova
Stephan Hoyer writes:

> On Tue, Sep 13, 2016 at 11:05 AM, Lluís Vilanova  wrote:
> Whenever we repr an array using 'S', we can instead show a unicode in py3.
> That
> keeps the binary representation, but will always show the expected result 
> to
> users, and it's only a handful of lines added to dump_data().

> If needed, I could easily add a bytes array to make the alternative 
> explicit
> (where py3 would repr the contents as b'foo').

> This would only leave the less-common paths inconsistent across python
> versions,
> which should not be a problem for most examples/doctests:

> * A 'U' array will show u'foo' in py2 and 'foo' in py3.
> * The new binary array will show 'foo' in py2 and b'foo' in py3 (that 
> could
> also
> be patched on the repr code).
> * A 'O' array will not be able to do any meaningful repr conversions.


> A more complex alternative (and actually closer to what I'm proposing) is 
> to
> modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
> string. It would have the binary compatibility, while being a unicode 
> string
> in
> practice.

> I'm afraid these are both also non-starters at this point. NumPy's string 
> dtype
> corresponds to bytes on Python 3, and you can use it to store arbitrary binary
> values. Would it really be an improvement to change the repr, if the scalar
> value resulting from indexing is still bytes?


> The sanest approach is probably a new dtype for one-byte strings. We talked
> about this a few years ago, but nobody has implemented it yet:
> http://numpy-discussion.scipy.narkive.com/3nqDu3Zk/a-one-byte-string-dtype

From the ref manual, 'S' is a "(byte-)string", which (to me) should never have
non-printable characters. That's why I'm advocating "S" to be your proposed
one-byte strings, while a new "B" dtype is needed for arbitrary binary arrays.
This has the added benefit of making docstrings correct on both py2 and py3.

But I won't keep pushing for this; I understand the backwards-compatibility
issues mentioned before. Maybe "S" should just be deprecated, "s" (as the
one-byte strings) and "B" added instead, and all docstrings and tests changed to
"s".

In any case, after reading the whole thread, it's not clear to me what's the
consensus on what the solution should be (Chris's summary is the closest thing
to that).

Cheers,
  Lluis
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Chris Barker
On Tue, Sep 13, 2016 at 11:05 AM, Lluís Vilanova 
wrote:

> Great, that's the type of info I wanted to get before going forward. I
> guess
> there's code relying on the binary representation of 'S' to do mmap's or
> access
> the array's raw contents. Is that right?


yes, there is a LOT of code, most of it third party, that relies on
particular binary representations of the numpy dtypes.

There is a fundamental semantic difference between a string and a byte
> array,
> that's the core of the problem.
>

well yes. but they were mingled in py2, and the 'S' dtype is essentially a
py2 string. But in py3, it maps more closely with bytes than string --
though yes, not exactly either :-(

Here's an alternative that only handles the repr.
>


> Whenever we repr an array using 'S', we can instead show a unicode in py3.
> That
> keeps the binary representation, but will always show the expected result
> to
> users, and it's only a handful of lines added to dump_data().
>

This would probably be more confusing than helpful -- if a 'S' object
converts to a bytes object, than it's repr should show that.

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Stephan Hoyer
On Tue, Sep 13, 2016 at 11:05 AM, Lluís Vilanova 
wrote:

> Whenever we repr an array using 'S', we can instead show a unicode in py3.
> That
> keeps the binary representation, but will always show the expected result
> to
> users, and it's only a handful of lines added to dump_data().
>
> If needed, I could easily add a bytes array to make the alternative
> explicit
> (where py3 would repr the contents as b'foo').
>
> This would only leave the less-common paths inconsistent across python
> versions,
> which should not be a problem for most examples/doctests:
>
> * A 'U' array will show u'foo' in py2 and 'foo' in py3.
> * The new binary array will show 'foo' in py2 and b'foo' in py3 (that
> could also
>   be patched on the repr code).
> * A 'O' array will not be able to do any meaningful repr conversions.
>
>
> A more complex alternative (and actually closer to what I'm proposing) is
> to
> modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
> string. It would have the binary compatibility, while being a unicode
> string in
> practice.


I'm afraid these are both also non-starters at this point. NumPy's string
dtype corresponds to bytes on Python 3, and you can use it to store
arbitrary binary values. Would it really be an improvement to change the
repr, if the scalar value resulting from indexing is still bytes?

The sanest approach is probably a new dtype for one-byte strings. We talked
about this a few years ago, but nobody has implemented it yet:
http://numpy-discussion.scipy.narkive.com/3nqDu3Zk/a-one-byte-string-dtype

(normally I would link to the archives on scipy.org, but the certificate
for HTTPS has expired so you see a big error message right now...)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Lluís Vilanova
Chris Barker writes:

> We had a big long discussion about this on this list a while back (maybe 2 yrs
> ago???) please search the archives to find it. Though I'm pretty sure that we
> never did come to a conclusion. I think it stared with wanting better support
> ofr unicode in loadtxt and the like, and ended up delving into other encodings
> for the 'U' dtype, and maybe a single byte string dtype (latin-1), or maybe a
> variable-size unicode object like Py3's, or...

> However, it is absolutely a non-starter to change the binary representation of
> the 'S' type in any version of numpy. Due to the legacy of py2 (and, indeed,
> most computing environments) 'S' is a single byte string representation. And 
> the
> binary representation is often really key to numpy use.
> Period, end of story.

Great, that's the type of info I wanted to get before going forward. I guess
there's code relying on the binary representation of 'S' to do mmap's or access
the array's raw contents. Is that right?


> And that maps to a py2 string and py3 bytes object.

> py2 does, of course, have a Unicode object as well. If you want your code (and
> doctests, and ...) to be compatible, then you should probably go to Unicode
> strings everywhere. py3 now supports the u'string' no-op literal to make this
> easier.

> (though I guess the __repr__ won't tack on that 'u', which is going to be a
> problem for docstrings).

That's exactly the problem. Doing all examples and doctests with 'U' instead of
'S' will break it for py2 instead of py3.


> Note also that py3 has added more an more "string-like" support to the bytes
> object, so it's not too bad to go bytes-only.

There is a fundamental semantic difference between a string and a byte array,
that's the core of the problem.


Here's an alternative that only handles the repr. Separate fixes would be needed
for loadtxt's and genfromtxt's problems (Sevastian Berg briefly pointed at that,
but I'd like to know more).

Whenever we repr an array using 'S', we can instead show a unicode in py3. That
keeps the binary representation, but will always show the expected result to
users, and it's only a handful of lines added to dump_data().

If needed, I could easily add a bytes array to make the alternative explicit
(where py3 would repr the contents as b'foo').

This would only leave the less-common paths inconsistent across python versions,
which should not be a problem for most examples/doctests:

* A 'U' array will show u'foo' in py2 and 'foo' in py3.
* The new binary array will show 'foo' in py2 and b'foo' in py3 (that could also
  be patched on the repr code).
* A 'O' array will not be able to do any meaningful repr conversions.


A more complex alternative (and actually closer to what I'm proposing) is to
modify numpy in py3 to restrict 'S' to using 8-bit points in a unicode
string. It would have the binary compatibility, while being a unicode string in
practice.


Cheers,
  Lluis
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Chris Barker
We had a big long discussion about this on this list a while back (maybe 2
yrs ago???) please search the archives to find it. Though I'm pretty sure
that we never did come to a conclusion. I think it stared with wanting
better support ofr unicode in loadtxt and the like, and ended up delving
into other encodings for the 'U' dtype, and maybe a single byte string
dtype (latin-1), or maybe a variable-size unicode object like Py3's, or...

However, it is absolutely a non-starter to change the binary representation
of the 'S' type in any version of numpy. Due to the legacy of py2 (and,
indeed, most computing environments) 'S' is a single byte string
representation. And the binary representation is often really key to numpy
use.
Period, end of story.

And that maps to a py2 string and py3 bytes object.

py2 does, of course, have a Unicode object as well. If you want your code
(and doctests, and ...) to be compatible, then you should probably go to
Unicode strings everywhere. py3 now supports the u'string' no-op literal to
make this easier.

(though I guess the __repr__ won't tack on that 'u', which is going to be a
problem for docstrings).

Note also that py3 has added more an more "string-like" support to the
bytes object, so it's not too bad to go bytes-only.

-CHB


On Tue, Sep 13, 2016 at 7:21 AM, Lluís Vilanova  wrote:

> Sebastian Berg writes:
>
> > On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
> >> Hi! I'm giving a shot to issue #3184 [1], based on the observation
> >> that the
> >> string dtype ('S') under python 3 uses byte arrays instead of unicode
> >> (the only
> >> readable string type in python 3).
> >>
> >> This brings two major problems:
> >>
> >> * numpy code has to go through loops to open and read files as binary
> >> data to
> >>   load text into a bytes array, and does not play well with users
> >> providing
> >>   string (unicode) arguments
> >>
> >> * the repr of these arrays shows strings as b'text' instead of
> >> 'text', which
> >>   breaks doctests of software built on numpy
> >>
> >> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
> >> (NPY_STRING and
> >> NPY_UNICODE).
> >>
> >> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
> >> internal
> >> implementation) will provide the best backwards compatibility, but is
> >> more
> >> cumbersome to implement.
>
> > I am not sure how that can be possible. Those types are fundamentally
> > different in how they store their data. String types use one byte per
> > character, unicode types will use 4 bytes per character. You can maybe
> > default to unicode in more cases in python 3, but you cannot make them
> > identical internally.
>
> BTW, by identical I mean having two externally visible types, but a common
> implementation in python 3 (that of NPY_UNICODE).
>
> The as-sane but not backwards-compatible option (I'm asking if this is
> acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE
> implementation, and making 'U' (and np.unicode_) and alias for 'S' (and
> np.string_).
>
>
> Cheers,
>   Lluis
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Lluís Vilanova
Sebastian Berg writes:

> On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
>> Hi! I'm giving a shot to issue #3184 [1], based on the observation
>> that the
>> string dtype ('S') under python 3 uses byte arrays instead of unicode
>> (the only
>> readable string type in python 3).
>> 
>> This brings two major problems:
>> 
>> * numpy code has to go through loops to open and read files as binary
>> data to
>>   load text into a bytes array, and does not play well with users
>> providing
>>   string (unicode) arguments
>> 
>> * the repr of these arrays shows strings as b'text' instead of
>> 'text', which
>>   breaks doctests of software built on numpy
>> 
>> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
>> (NPY_STRING and
>> NPY_UNICODE).
>> 
>> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
>> internal
>> implementation) will provide the best backwards compatibility, but is
>> more
>> cumbersome to implement.

> I am not sure how that can be possible. Those types are fundamentally
> different in how they store their data. String types use one byte per
> character, unicode types will use 4 bytes per character. You can maybe
> default to unicode in more cases in python 3, but you cannot make them
> identical internally.

BTW, by identical I mean having two externally visible types, but a common
implementation in python 3 (that of NPY_UNICODE).

The as-sane but not backwards-compatible option (I'm asking if this is
acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE
implementation, and making 'U' (and np.unicode_) and alias for 'S' (and
np.string_).


Cheers,
  Lluis
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Lluís Vilanova
Sebastian Berg writes:

> On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
>> Hi! I'm giving a shot to issue #3184 [1], based on the observation
>> that the
>> string dtype ('S') under python 3 uses byte arrays instead of unicode
>> (the only
>> readable string type in python 3).
>> 
>> This brings two major problems:
>> 
>> * numpy code has to go through loops to open and read files as binary
>> data to
>>   load text into a bytes array, and does not play well with users
>> providing
>>   string (unicode) arguments
>> 
>> * the repr of these arrays shows strings as b'text' instead of
>> 'text', which
>>   breaks doctests of software built on numpy
>> 
>> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
>> (NPY_STRING and
>> NPY_UNICODE).
>> 
>> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
>> internal
>> implementation) will provide the best backwards compatibility, but is
>> more
>> cumbersome to implement.

> I am not sure how that can be possible. Those types are fundamentally
> different in how they store their data. String types use one byte per
> character, unicode types will use 4 bytes per character. You can maybe
> default to unicode in more cases in python 3, but you cannot make them
> identical internally.

> What about giving `np.loadtxt` an encoding kwarg or something along
> that line?

np.loadtxt and np.genfromtxt are already quite complex in handling the implicit
conversion to byte-array imposed by numpy's port to python 3, and still fail in
some corner cases.

This conversion is also inherently surprising to users, since what I'd get in
python 2:

  >>> np.array('foo', dtype='S')
  array('foo', dtype='|S3')

In python 3 gives me a surprising (note the prefix on the resulting string):

  >>> np.array('foo', dtype='S')
  array(b'foo', dtype='|S3')

It's not only surprising, but also breaks absolutely all the doctests I have
with arrays that contain strings (it even breaks numpy's examples).

That's why adding an encoding kwarg (better than the current auto-magical
conversion to binary) won't solve my problems. The 'S' dtype will still be a
binary array, which shows up in the repr.


Since all strings in python 3 are unicode, I'm expecting "string" and "unicode"
arrays in numpy to be the same *and* show up as strings (e.g., 'foo' instead of
b'foo').

Yes, the difference between these types is in how they store their data. What
I'm proposing is to always use unicode in python 3.

If necessary, we can add a new dtype that lets users store raw byte arrays. By
making them explicitly byte arrays, that shouldn't raise any new surprises.


I already started doing the changes I described (as a result from the discussion
in #3184 [1]), but wanted to double-check with the list before getting deeper
into it.

[1] https://github.com/numpy/numpy/issues/3184


Cheers,
  Lluis
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String & unicode arrays vs text loading in python 3

2016-09-13 Thread Sebastian Berg
On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote:
> Hi! I'm giving a shot to issue #3184 [1], based on the observation
> that the
> string dtype ('S') under python 3 uses byte arrays instead of unicode
> (the only
> readable string type in python 3).
> 
> This brings two major problems:
> 
> * numpy code has to go through loops to open and read files as binary
> data to
>   load text into a bytes array, and does not play well with users
> providing
>   string (unicode) arguments
> 
> * the repr of these arrays shows strings as b'text' instead of
> 'text', which
>   breaks doctests of software built on numpy
> 
> What I'm trying to do is make dtypes 'S' and 'U' equivalnt
> (NPY_STRING and
> NPY_UNICODE).
> 
> Now the question. Keeping 'S' and 'U' as separate dtypes (but same
> internal
> implementation) will provide the best backwards compatibility, but is
> more
> cumbersome to implement.

I am not sure how that can be possible. Those types are fundamentally
different in how they store their data. String types use one byte per
character, unicode types will use 4 bytes per character. You can maybe
default to unicode in more cases in python 3, but you cannot make them
identical internally.

What about giving `np.loadtxt` an encoding kwarg or something along
that line?

- Sebastian


> 
> Is it acceptable to internally just translate all appearances of 'S'
> (NPY_STRING) to 'U' (NPY_UNICODE) and get rid of one of the two when
> running in
> python 3?
> 
> The main drawback I see is that dtype reprs would not always be as
> expected:
> 
>    # python 2
>    >>> np.array('foo', dtype='S')
>    array('foo',
>  dtype='|S3')
> 
>    # python 3
>    >>> np.array('foo', dtype='S')
>    array('foo',
>  dtype=' 
> 
> [1] https://github.com/numpy/numpy/issues/3184
> 
> 
> Cheers,
>   Lluis
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
> 

signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion