Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Chris Barker - NOAA Federal
On Jan 21, 2014, at 4:58 PM, David Goldsmith  wrote:

>
> OK, well that's definitely beyond my level of expertise.

Well, it's in github--now's as good a time as any to learn github
collaboration...

-Fork the numpy source.

-Create a new file in:
numpy/doc/neps

Point folks to it here so they can comment, etc.

At some point, issue a pull request, and it can get merged into the
main source for final polishing...

-Chris






>
> DG
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread David Goldsmith
Date: Tue, 21 Jan 2014 19:20:12 +

> From: Robert Kern 
> Subject: Re: [Numpy-discussion] A one-byte string dtype?
>


> The wiki is frozen. Please do not add anything to it. It plays no role in
> our current development workflow. Drafting a NEP or two and iterating on
> them would be the next step.
>
> --
> Robert Kern
>

OK, well that's definitely beyond my level of expertise.

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Robert Kern
On Tue, Jan 21, 2014 at 6:34 PM, David Goldsmith 
wrote:

>> I can certainly get one started (but I don't think I can faithfully
>> summarize all this thread's current content, so I apologize in advance
for
>> leaving that undone).
>>
>> DG
>
> OK, I'm "lost" already: is there general agreement that this should
"jump" straight to one or more NEP's?  If not (or if there should be a Wiki
page for it additionally), should such become part of the NumPy Wiki @
Sourceforge or the SciPy Wiki at the scipy.org site?  If the latter, is
one's SciPy Wiki login the same as one's mailing list subscriber
maintenance login?  I guess starting such a page is not as trivial as I had
assumed.

The wiki is frozen. Please do not add anything to it. It plays no role in
our current development workflow. Drafting a NEP or two and iterating on
them would be the next step.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread David Goldsmith
On Tue, Jan 21, 2014 at 10:00 AM, wrote:

> Date: Tue, 21 Jan 2014 09:53:25 -0800
> From: David Goldsmith 
> Subject: Re: [Numpy-discussion] A one-byte string dtype?
> To: numpy-discussion@scipy.org
> Message-ID:
>  7altpxmrz4miujy2xebyi_fy5...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> > Date: Tue, 21 Jan 2014 17:35:26 +
> > From: Nathaniel Smith 
> > Subject: Re: [Numpy-discussion] A one-byte string dtype?
> > To: Discussion of Numerical Python 
> > Message-ID:
> >  > ke3xlga2+gz+qd4f0xs2uboeysg...@mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > On 21 Jan 2014 17:28, "David Goldsmith"  wrote:
> > >
> > >
> > > Am I the only one who feels that this (very important--I'm being
> sincere,
> > not sarcastic) thread has matured and specialized enough to warrant it's
> > own home on the Wiki?
> >
> > Sounds plausible, perhaps you could write up such a page?
> >
> > -n
> >
>
> I can certainly get one started (but I don't think I can faithfully
> summarize all this thread's current content, so I apologize in advance for
> leaving that undone).
>
> DG
>

OK, I'm "lost" already: is there general agreement that this should "jump"
straight to one or more NEP's?  If not (or if there should be a Wiki page
for it additionally), should such become part of the NumPy Wiki @
Sourceforge or the SciPy Wiki at the scipy.org site?  If the latter, is
one's SciPy Wiki login the same as one's mailing list subscriber
maintenance login?  I guess starting such a page is not as trivial as I had
assumed.

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Charles R Harris
On Tue, Jan 21, 2014 at 11:00 AM, Chris Barker wrote:

> A  lot of good discussion here -- to much to comment individually, but it
> seems we can boil it down to a couple somewhat distinct proposals:
>
> 1) a one-byte-per-char dtype:
>
> This would provide compact, high efficiency storage for common text
> for scientific computing. It is analogous to a lower-precision numeric type
> -- i.e. it could not store any unicode strings -- only the subset that are
> compatible the suggested encoding.
>  Suggested encoding: latin-1
>  Other options:
>  - ascii only.
>  - settable to any one-byte per char encoding supported by python
> I like this IFF it's pretty easy, but it may
> add significant complications (and overhead) for comparisons, etc
>
> NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
> back to the py2 mojibake hell" -- the goal here is to very clearly have
> this be text data, and have a clearly defined encoding. Which is why we
> can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
> to conveniently and efficiently use numpy for text that is ansi compatible.
>
> 2) a utf-8 dtype:
> NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
> per char encoding, so would not snuggly into the numpy data model.
>It would give compact memory use for mostly-ascii data, so that would
> be nice.
>
> 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
>   This would get us the advantages of the new py3 unicode model -- compact
> and efficient when it can be, but also supporting all of unicode. Honestly,
> this seems like more work than it's worth to me, at least given the current
> numpy dtype model -- maybe a nice addition to dynd. YOu can, after
> all, simply use an object array with py3 strings in it. Though perhaps
> using the py3 unicode type, but having a dtype that specifically links to
> that, rather than a generic python object would be a good compromise.
>
>
> Hmm -- I guess despite what I said, I just write the starting pint for a
> NEP...
>
>
Should also mention the reasons for adding a new data type.



Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Chris Barker
A  lot of good discussion here -- to much to comment individually, but it
seems we can boil it down to a couple somewhat distinct proposals:

1) a one-byte-per-char dtype:

This would provide compact, high efficiency storage for common text
for scientific computing. It is analogous to a lower-precision numeric type
-- i.e. it could not store any unicode strings -- only the subset that are
compatible the suggested encoding.
 Suggested encoding: latin-1
 Other options:
 - ascii only.
 - settable to any one-byte per char encoding supported by python
I like this IFF it's pretty easy, but it may
add significant complications (and overhead) for comparisons, etc

NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
back to the py2 mojibake hell" -- the goal here is to very clearly have
this be text data, and have a clearly defined encoding. Which is why we
can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
to conveniently and efficiently use numpy for text that is ansi compatible.

2) a utf-8 dtype:
NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
per char encoding, so would not snuggly into the numpy data model.
   It would give compact memory use for mostly-ascii data, so that would be
nice.

3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
  This would get us the advantages of the new py3 unicode model -- compact
and efficient when it can be, but also supporting all of unicode. Honestly,
this seems like more work than it's worth to me, at least given the current
numpy dtype model -- maybe a nice addition to dynd. YOu can, after
all, simply use an object array with py3 strings in it. Though perhaps
using the py3 unicode type, but having a dtype that specifically links to
that, rather than a generic python object would be a good compromise.


Hmm -- I guess despite what I said, I just write the starting pint for a
NEP...

(or two, actually...)

-Chris

















On Tue, Jan 21, 2014 at 9:46 AM, Chris Barker  wrote:

> On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith 
> wrote:
>
>>
>> Am I the only one who feels that this (very important--I'm being sincere,
>> not sarcastic) thread has matured and specialized enough to warrant it's
>> own home on the Wiki?
>>
>
> Or  maybe a NEP?
>
> https://github.com/numpy/numpy/tree/master/doc/neps
>
> sorry -- really swamped this week, so I won't be writing it...
>
> -Chris
>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread David Goldsmith
> Date: Tue, 21 Jan 2014 17:35:26 +
> From: Nathaniel Smith 
> Subject: Re: [Numpy-discussion] A one-byte string dtype?
> To: Discussion of Numerical Python 
> Message-ID:
>  ke3xlga2+gz+qd4f0xs2uboeysg...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On 21 Jan 2014 17:28, "David Goldsmith"  wrote:
> >
> >
> > Am I the only one who feels that this (very important--I'm being sincere,
> not sarcastic) thread has matured and specialized enough to warrant it's
> own home on the Wiki?
>
> Sounds plausible, perhaps you could write up such a page?
>
> -n
>

I can certainly get one started (but I don't think I can faithfully
summarize all this thread's current content, so I apologize in advance for
leaving that undone).

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Chris Barker
On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith wrote:

>
> Am I the only one who feels that this (very important--I'm being sincere,
> not sarcastic) thread has matured and specialized enough to warrant it's
> own home on the Wiki?
>

Or  maybe a NEP?

https://github.com/numpy/numpy/tree/master/doc/neps

sorry -- really swamped this week, so I won't be writing it...

-Chris




-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Nathaniel Smith
On 21 Jan 2014 17:28, "David Goldsmith"  wrote:
>
>
> Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

Sounds plausible, perhaps you could write up such a page?

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Sebastian Berg
On Tue, 2014-01-21 at 07:48 -0700, Charles R Harris wrote:
> 
> 
> 
> On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas
>  wrote:
> 
> 
> 
> On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris
>  wrote:
> 
> 
> 
> On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas
>  wrote:
> 
> 
> 
> On Mon, Jan 20, 2014 at 6:12 PM, Charles R
> Harris  wrote:
> 
> 
> 
> On Mon, Jan 20, 2014 at 3:58 PM,
> Charles R Harris
>  wrote:
> 
> 
> 
> On Mon, Jan 20, 2014 at 3:35
> PM, Nathaniel Smith
>  wrote:
> On Mon, Jan 20, 2014
> at 10:28 PM, Charles R
> Harris
>  
> wrote:
> >
> >
> >
> > On Mon, Jan 20, 2014
> at 2:27 PM, Oscar
> Benjamin
> 
> > wrote:
> >>
> >>
> >> On Jan 20, 2014
> 8:35 PM, "Charles R
> Harris"
> 
> >> wrote:
> >> >
> >> > I think we may
> want something like
> PEP 393. The S
> datatype may be the
> >> > wrong place to
> look, we might want a
> modification of U
> instead so as to
> >> > transparently get
> the benefit of python
> strings.
> >>
> >> The approach taken
> in PEP 393 (the FSR)
> makes more sense for
> str than it
> >> does for numpy
> arrays for two
> reasons: str is
> immutable and opaque.
> >>
> >> Since str is
> immutable the maximum
> code point in the
> string can be
> >> determined once
> when the string is
> created before
> anything else can get
> a
> >> pointer to the
> string buffer.
> >>
> >> Since it is opaque
> no one can rightly
> expect it to expose a
> particular
> >> binary format so it
> is free to choose
>  

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Charles R Harris
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:

>
>
>
> On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
>>
>>
>>
>> On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <
>> aldcr...@head.cfa.harvard.edu> wrote:
>>
>>>
>>>
>>>
>>> On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <
>>> charlesr.har...@gmail.com> wrote:
>>>



 On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
 charlesr.har...@gmail.com> wrote:

>
>
>
> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith wrote:
>
>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>>  wrote:
>> >
>> >
>> >
>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
>> oscar.j.benja...@gmail.com>
>> > wrote:
>> >>
>> >>
>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <
>> charlesr.har...@gmail.com>
>> >> wrote:
>> >> >
>> >> > I think we may want something like PEP 393. The S datatype may
>> be the
>> >> > wrong place to look, we might want a modification of U instead
>> so as to
>> >> > transparently get the benefit of python strings.
>> >>
>> >> The approach taken in PEP 393 (the FSR) makes more sense for str
>> than it
>> >> does for numpy arrays for two reasons: str is immutable and opaque.
>> >>
>> >> Since str is immutable the maximum code point in the string can be
>> >> determined once when the string is created before anything else
>> can get a
>> >> pointer to the string buffer.
>> >>
>> >> Since it is opaque no one can rightly expect it to expose a
>> particular
>> >> binary format so it is free to choose without compromising any
>> expected
>> >> semantics.
>> >>
>> >> If someone can call buffer on an array then the FSR is a semantic
>> change.
>> >>
>> >> If a numpy 'U' array used the FSR and consisted only of ASCII
>> characters
>> >> then it would have a one byte per char buffer. What then happens
>> if you put
>> >> a higher code point in? The buffer needs to be resized and the
>> data copied
>> >> over. But then what happens to any buffer objects or array views?
>> They would
>> >> be pointing at the old buffer from before the resize. Subsequent
>> >> modifications to the resized array would not show up in other
>> views and vice
>> >> versa.
>> >>
>> >> I don't think that this can be done transparently since users of a
>> numpy
>> >> array need to know about the binary representation. That's why I
>> suggest a
>> >> dtype that has an encoding. Only in that way can it consistently
>> have both a
>> >> binary and a text interface.
>> >
>> >
>> > I didn't say we should change the S type, but that we should have
>> something,
>> > say 's', that appeared to python as a string. I think if we want
>> transparent
>> > string interoperability with python together with a compressed
>> > representation, and I think we need both, we are going to have to
>> deal with
>> > the difficulties of utf-8. That means raising errors if the string
>> doesn't
>> > fit in the allotted size, etc. Mind, this is a workaround for the
>> mass of
>> > ascii data that is already out there, not a substitute for 'U'.
>>
>> If we're going to be taking that much trouble, I'd suggest going ahead
>> and adding a variable-length string type (where the array itself
>> contains a pointer to a lookaside buffer, maybe with an optimization
>> for stashing short strings directly). The fixed-length requirement is
>> pretty onerous for lots of applications (e.g., pandas always uses
>> dtype="O" for strings -- and that might be a good workaround for some
>> people in this thread for now). The use of a lookaside buffer would
>> also make it practical to resize the buffer when the maximum code
>> point changed, for that matter...
>>
>
 The more I think about it, the more I think we may need to do that.
 Note that dynd has ragged arrays and I think they are implemented as
 pointers to buffers. The easy way for us to do that would be a
 specialization of object arrays to string types only as you suggest.

>>>
>>> Is this approach intended to be in *addition to* the latin-1 "s" type
>>> originally proposed by Chris, or *instead of* that?
>>>
>>>
>> Well, that's open for discussion. The problem is to have something that
>> is both compact (latin-1) and interoperates transparently with python 3
>> strings (utf-8). A latin-1 type would be easier to implement and would
>> probably be a better choice for something available in both python 2 and
>> python 3, but unless the python 3 developers come up with something clever
>> I don't  see how to make it behave transparently as a string in python 3.
>> OTOH, it's not clear to me 

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Oscar Benjamin
On Tue, Jan 21, 2014 at 06:55:29AM -0700, Charles R Harris wrote:
>
> Well, that's open for discussion. The problem is to have something that is
> both compact (latin-1) and interoperates transparently with python 3
> strings (utf-8). A latin-1 type would be easier to implement and would
> probably be a better choice for something available in both python 2 and
> python 3, but unless the python 3 developers come up with something clever
> I don't  see how to make it behave transparently as a string in python 3.
> OTOH, it's not clear to me how to make utf-8 operate transparently with
> python 2 strings, especially as the unicode representation choices in
> python 2 are ucs-2 or ucs-4

On Python 2, unicode strings can operate transparently with byte strings:

$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as bnp
>>> import numpy as np
>>> a = np.array([u'\xd5scar'], dtype='U')
>>> a
array([u'\xd5scar'], 
  dtype='>> a[0]
u'\xd5scar'
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print(a[0])  # Encodes as 'utf-8'
Õscar
>>> 'My name is %s' % a[0]  # Decodes as ASCII
u'My name is \xd5scar'
>>> print('My name is %s' % a[0])  # Encodes as UTF-8
My name is Õscar

This is no better worse than the rest of the Py2 text model. So if the new
dtype always returns a unicode string under Py2 it should work (as well as the
Py2 text model ever does).

> and the python 3 work adding utf-16 and utf-8
> is unlikely to be backported. The problem may be unsolvable in a completely
> satisfactory way.

What do you mean by this? PEP 393 uses UCS-1/2/4 not utf-8/16/32 i.e. it
always uses a fixed-width encoding.

You can just use the CPython C-API to create the unicode strings. The simplest
way is probably use utf-8 internally and then call PyUnicode_DecodeUTF8 and
PyUnicode_EncodeUTF8 at the boundaries. This should work fine on Python 2.x
and 3.x. It obviates any need to think about pre-3.3 narrow and wide builds
and post-3.3 FSR formats.

Unlike Python's str there isn't much need to be able to efficiently slice or
index within the string array element. Indexing into the array to get the
string requires creating a new object, so you may as well just decode from
utf-8 at that point [it's big-O(num chars) either way]. There's no need to
constrain it to fixed-width encodings like the FSR in which case utf-8 is
clearly the best choice as:

1) It covers the whole unicode spectrum.
2) It uses 1 byte-per-char for ASCII.
3) UTF-8 is a big optimisation target for CPython (so it's fast).


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Aldcroft, Thomas
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris  wrote:

>
>
>
> On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <
> aldcr...@head.cfa.harvard.edu> wrote:
>
>>
>>
>>
>> On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <
>> charlesr.har...@gmail.com> wrote:
>>
>>>
>>>
>>>
>>> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
>>> charlesr.har...@gmail.com> wrote:
>>>



 On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith  wrote:

> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>  wrote:
> >
> >
> >
> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
> oscar.j.benja...@gmail.com>
> > wrote:
> >>
> >>
> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <
> charlesr.har...@gmail.com>
> >> wrote:
> >> >
> >> > I think we may want something like PEP 393. The S datatype may be
> the
> >> > wrong place to look, we might want a modification of U instead so
> as to
> >> > transparently get the benefit of python strings.
> >>
> >> The approach taken in PEP 393 (the FSR) makes more sense for str
> than it
> >> does for numpy arrays for two reasons: str is immutable and opaque.
> >>
> >> Since str is immutable the maximum code point in the string can be
> >> determined once when the string is created before anything else can
> get a
> >> pointer to the string buffer.
> >>
> >> Since it is opaque no one can rightly expect it to expose a
> particular
> >> binary format so it is free to choose without compromising any
> expected
> >> semantics.
> >>
> >> If someone can call buffer on an array then the FSR is a semantic
> change.
> >>
> >> If a numpy 'U' array used the FSR and consisted only of ASCII
> characters
> >> then it would have a one byte per char buffer. What then happens if
> you put
> >> a higher code point in? The buffer needs to be resized and the data
> copied
> >> over. But then what happens to any buffer objects or array views?
> They would
> >> be pointing at the old buffer from before the resize. Subsequent
> >> modifications to the resized array would not show up in other views
> and vice
> >> versa.
> >>
> >> I don't think that this can be done transparently since users of a
> numpy
> >> array need to know about the binary representation. That's why I
> suggest a
> >> dtype that has an encoding. Only in that way can it consistently
> have both a
> >> binary and a text interface.
> >
> >
> > I didn't say we should change the S type, but that we should have
> something,
> > say 's', that appeared to python as a string. I think if we want
> transparent
> > string interoperability with python together with a compressed
> > representation, and I think we need both, we are going to have to
> deal with
> > the difficulties of utf-8. That means raising errors if the string
> doesn't
> > fit in the allotted size, etc. Mind, this is a workaround for the
> mass of
> > ascii data that is already out there, not a substitute for 'U'.
>
> If we're going to be taking that much trouble, I'd suggest going ahead
> and adding a variable-length string type (where the array itself
> contains a pointer to a lookaside buffer, maybe with an optimization
> for stashing short strings directly). The fixed-length requirement is
> pretty onerous for lots of applications (e.g., pandas always uses
> dtype="O" for strings -- and that might be a good workaround for some
> people in this thread for now). The use of a lookaside buffer would
> also make it practical to resize the buffer when the maximum code
> point changed, for that matter...
>

>>> The more I think about it, the more I think we may need to do that. Note
>>> that dynd has ragged arrays and I think they are implemented as pointers to
>>> buffers. The easy way for us to do that would be a specialization of object
>>> arrays to string types only as you suggest.
>>>
>>
>> Is this approach intended to be in *addition to* the latin-1 "s" type
>> originally proposed by Chris, or *instead of* that?
>>
>>
> Well, that's open for discussion. The problem is to have something that is
> both compact (latin-1) and interoperates transparently with python 3
> strings (utf-8). A latin-1 type would be easier to implement and would
> probably be a better choice for something available in both python 2 and
> python 3, but unless the python 3 developers come up with something clever
> I don't  see how to make it behave transparently as a string in python 3.
> OTOH, it's not clear to me how to make utf-8 operate transparently with
> python 2 strings, especially as the unicode representation choices in
> python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
> is unlikely to be backported. The problem may be unsolvabl

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Charles R Harris
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:

>
>
>
> On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
>>
>>
>>
>> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
>> charlesr.har...@gmail.com> wrote:
>>
>>>
>>>
>>>
>>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith  wrote:
>>>
 On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
  wrote:
 >
 >
 >
 > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
 oscar.j.benja...@gmail.com>
 > wrote:
 >>
 >>
 >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <
 charlesr.har...@gmail.com>
 >> wrote:
 >> >
 >> > I think we may want something like PEP 393. The S datatype may be
 the
 >> > wrong place to look, we might want a modification of U instead so
 as to
 >> > transparently get the benefit of python strings.
 >>
 >> The approach taken in PEP 393 (the FSR) makes more sense for str
 than it
 >> does for numpy arrays for two reasons: str is immutable and opaque.
 >>
 >> Since str is immutable the maximum code point in the string can be
 >> determined once when the string is created before anything else can
 get a
 >> pointer to the string buffer.
 >>
 >> Since it is opaque no one can rightly expect it to expose a
 particular
 >> binary format so it is free to choose without compromising any
 expected
 >> semantics.
 >>
 >> If someone can call buffer on an array then the FSR is a semantic
 change.
 >>
 >> If a numpy 'U' array used the FSR and consisted only of ASCII
 characters
 >> then it would have a one byte per char buffer. What then happens if
 you put
 >> a higher code point in? The buffer needs to be resized and the data
 copied
 >> over. But then what happens to any buffer objects or array views?
 They would
 >> be pointing at the old buffer from before the resize. Subsequent
 >> modifications to the resized array would not show up in other views
 and vice
 >> versa.
 >>
 >> I don't think that this can be done transparently since users of a
 numpy
 >> array need to know about the binary representation. That's why I
 suggest a
 >> dtype that has an encoding. Only in that way can it consistently
 have both a
 >> binary and a text interface.
 >
 >
 > I didn't say we should change the S type, but that we should have
 something,
 > say 's', that appeared to python as a string. I think if we want
 transparent
 > string interoperability with python together with a compressed
 > representation, and I think we need both, we are going to have to
 deal with
 > the difficulties of utf-8. That means raising errors if the string
 doesn't
 > fit in the allotted size, etc. Mind, this is a workaround for the
 mass of
 > ascii data that is already out there, not a substitute for 'U'.

 If we're going to be taking that much trouble, I'd suggest going ahead
 and adding a variable-length string type (where the array itself
 contains a pointer to a lookaside buffer, maybe with an optimization
 for stashing short strings directly). The fixed-length requirement is
 pretty onerous for lots of applications (e.g., pandas always uses
 dtype="O" for strings -- and that might be a good workaround for some
 people in this thread for now). The use of a lookaside buffer would
 also make it practical to resize the buffer when the maximum code
 point changed, for that matter...

>>>
>> The more I think about it, the more I think we may need to do that. Note
>> that dynd has ragged arrays and I think they are implemented as pointers to
>> buffers. The easy way for us to do that would be a specialization of object
>> arrays to string types only as you suggest.
>>
>
> Is this approach intended to be in *addition to* the latin-1 "s" type
> originally proposed by Chris, or *instead of* that?
>
>
Well, that's open for discussion. The problem is to have something that is
both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't  see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussi

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Aldcroft, Thomas
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris  wrote:

>
>
>
> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
>>
>>
>>
>> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith  wrote:
>>
>>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>>>  wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
>>> oscar.j.benja...@gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <
>>> charlesr.har...@gmail.com>
>>> >> wrote:
>>> >> >
>>> >> > I think we may want something like PEP 393. The S datatype may be
>>> the
>>> >> > wrong place to look, we might want a modification of U instead so
>>> as to
>>> >> > transparently get the benefit of python strings.
>>> >>
>>> >> The approach taken in PEP 393 (the FSR) makes more sense for str than
>>> it
>>> >> does for numpy arrays for two reasons: str is immutable and opaque.
>>> >>
>>> >> Since str is immutable the maximum code point in the string can be
>>> >> determined once when the string is created before anything else can
>>> get a
>>> >> pointer to the string buffer.
>>> >>
>>> >> Since it is opaque no one can rightly expect it to expose a particular
>>> >> binary format so it is free to choose without compromising any
>>> expected
>>> >> semantics.
>>> >>
>>> >> If someone can call buffer on an array then the FSR is a semantic
>>> change.
>>> >>
>>> >> If a numpy 'U' array used the FSR and consisted only of ASCII
>>> characters
>>> >> then it would have a one byte per char buffer. What then happens if
>>> you put
>>> >> a higher code point in? The buffer needs to be resized and the data
>>> copied
>>> >> over. But then what happens to any buffer objects or array views?
>>> They would
>>> >> be pointing at the old buffer from before the resize. Subsequent
>>> >> modifications to the resized array would not show up in other views
>>> and vice
>>> >> versa.
>>> >>
>>> >> I don't think that this can be done transparently since users of a
>>> numpy
>>> >> array need to know about the binary representation. That's why I
>>> suggest a
>>> >> dtype that has an encoding. Only in that way can it consistently have
>>> both a
>>> >> binary and a text interface.
>>> >
>>> >
>>> > I didn't say we should change the S type, but that we should have
>>> something,
>>> > say 's', that appeared to python as a string. I think if we want
>>> transparent
>>> > string interoperability with python together with a compressed
>>> > representation, and I think we need both, we are going to have to deal
>>> with
>>> > the difficulties of utf-8. That means raising errors if the string
>>> doesn't
>>> > fit in the allotted size, etc. Mind, this is a workaround for the mass
>>> of
>>> > ascii data that is already out there, not a substitute for 'U'.
>>>
>>> If we're going to be taking that much trouble, I'd suggest going ahead
>>> and adding a variable-length string type (where the array itself
>>> contains a pointer to a lookaside buffer, maybe with an optimization
>>> for stashing short strings directly). The fixed-length requirement is
>>> pretty onerous for lots of applications (e.g., pandas always uses
>>> dtype="O" for strings -- and that might be a good workaround for some
>>> people in this thread for now). The use of a lookaside buffer would
>>> also make it practical to resize the buffer when the maximum code
>>> point changed, for that matter...
>>>
>>
> The more I think about it, the more I think we may need to do that. Note
> that dynd has ragged arrays and I think they are implemented as pointers to
> buffers. The easy way for us to do that would be a specialization of object
> arrays to string types only as you suggest.
>

Is this approach intended to be in *addition to* the latin-1 "s" type
originally proposed by Chris, or *instead of* that?

- Tom


>
> 
>
> Chuck
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Oscar Benjamin
On Tue, Jan 21, 2014 at 11:41:30AM +, Nathaniel Smith wrote:
> On 21 Jan 2014 11:13, "Oscar Benjamin"  wrote:
> > If the Numpy array would manage the buffers itself then that per string
> memory
> > overhead would be eliminated in exchange for an 8 byte pointer and at
> least 1
> > byte to represent the length of the string (assuming you can somehow use
> > Pascal strings when short enough - null bytes cannot be used). This gives
> an
> > overhead of 9 bytes per string (or 5 on 32 bit). In this case you save
> memory
> > if the strings are more than 3 characters long and you get at least a 50%
> > saving for strings longer than 9 characters.
> 
> There are various optimisations possible as well.
> 
> For ASCII strings of up to length 8, one could also use tagged pointers to
> eliminate the lookaside buffer entirely. (Alignment rules mean that
> pointers to allocated buffers always have the low bits zero; so you can
> make a rule that if the low bit is set to one, then this means the
> "pointer" itself should be interpreted as containing the string data; use
> the spare bit in the other bytes to encode the length.)
> 
> In some cases it may also make sense to let identical strings share
> buffers, though this adds some overhead for reference counting and
> interning.

Would this new dtype have an opaque memory representation? What would happen
in the following:

>>> a = numpy.array(['CGA', 'GAT'], dtype='s')

>>> memoryview(a)

>>> with open('file', 'wb') as fout:
... a.tofile(fout)

>>> with open('file', 'rb') as fin:
... a = numpy.fromfile(fin, dtype='s')

Should there be a different function for creating such an array from reading a
text file? Or would you just need to use fromiter:

>>> with open('file', encoding='utf-8') as fin:
... a = numpy.fromiter(fin, dtype='s')

>>> with open('file', encoding='utf-8') as fout:
... fout.writelines(line + '\n' for line in a)

(Note that the above would not be reversible if the strings contain newlines)

I think it Would be less confusing to use dtype='u' than dtype='U' in order to
signify that it is an optimised form of the 'U' dtype as far as access from
Python code is concerned? Calling it 's' only really makes sense if there is a
plan to deprecate dtype='S'.

How would it behave in Python 2? Would it return unicode strings there as
well?


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Nathaniel Smith
On 21 Jan 2014 11:13, "Oscar Benjamin"  wrote:
> If the Numpy array would manage the buffers itself then that per string
memory
> overhead would be eliminated in exchange for an 8 byte pointer and at
least 1
> byte to represent the length of the string (assuming you can somehow use
> Pascal strings when short enough - null bytes cannot be used). This gives
an
> overhead of 9 bytes per string (or 5 on 32 bit). In this case you save
memory
> if the strings are more than 3 characters long and you get at least a 50%
> saving for strings longer than 9 characters.

There are various optimisations possible as well.

For ASCII strings of up to length 8, one could also use tagged pointers to
eliminate the lookaside buffer entirely. (Alignment rules mean that
pointers to allocated buffers always have the low bits zero; so you can
make a rule that if the low bit is set to one, then this means the
"pointer" itself should be interpreted as containing the string data; use
the spare bit in the other bytes to encode the length.)

In some cases it may also make sense to let identical strings share
buffers, though this adds some overhead for reference counting and
interning.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Oscar Benjamin
On Mon, Jan 20, 2014 at 04:12:20PM -0700, Charles R Harris wrote:
> On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris  wrote:
> > On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith  wrote:
> >> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris 
> >>  wrote:
> >> >
> >> > I didn't say we should change the S type, but that we should have
> >> something,
> >> > say 's', that appeared to python as a string. I think if we want
> >> transparent
> >> > string interoperability with python together with a compressed
> >> > representation, and I think we need both, we are going to have to deal
> >> with
> >> > the difficulties of utf-8. That means raising errors if the string
> >> doesn't
> >> > fit in the allotted size, etc. Mind, this is a workaround for the mass
> >> of
> >> > ascii data that is already out there, not a substitute for 'U'.
> >>
> >> If we're going to be taking that much trouble, I'd suggest going ahead
> >> and adding a variable-length string type (where the array itself
> >> contains a pointer to a lookaside buffer, maybe with an optimization
> >> for stashing short strings directly). The fixed-length requirement is
> >> pretty onerous for lots of applications (e.g., pandas always uses
> >> dtype="O" for strings -- and that might be a good workaround for some
> >> people in this thread for now). The use of a lookaside buffer would
> >> also make it practical to resize the buffer when the maximum code
> >> point changed, for that matter...
> >>
> The more I think about it, the more I think we may need to do that. Note
> that dynd has ragged arrays and I think they are implemented as pointers to
> buffers. The easy way for us to do that would be a specialization of object
> arrays to string types only as you suggest.

This wouldn't necessarily help for the gigarows of short text strings use case
(depending on what "short" means). Also even if it technically saves memory
you may have a greater overhead from fragmenting your array all over the heap.

On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII
characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory
saving over dtype='U' only if the strings are 17 characters or more. To get a
50% saving over dtype='U' you'd need strings of at least 49 characters.

If the Numpy array would manage the buffers itself then that per string memory
overhead would be eliminated in exchange for an 8 byte pointer and at least 1
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives an
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.

Using utf-8 in the buffers eliminates the need to go around checking maximum
code points etc. so I would guess that would be simpler to implement (CPython
has now had to triple all of it's code paths that actually access the string
buffer).


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Charles R Harris
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris  wrote:

>
>
>
> On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith  wrote:
>
>> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>>  wrote:
>> >
>> >
>> >
>> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
>> oscar.j.benja...@gmail.com>
>> > wrote:
>> >>
>> >>
>> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" > >
>> >> wrote:
>> >> >
>> >> > I think we may want something like PEP 393. The S datatype may be the
>> >> > wrong place to look, we might want a modification of U instead so as
>> to
>> >> > transparently get the benefit of python strings.
>> >>
>> >> The approach taken in PEP 393 (the FSR) makes more sense for str than
>> it
>> >> does for numpy arrays for two reasons: str is immutable and opaque.
>> >>
>> >> Since str is immutable the maximum code point in the string can be
>> >> determined once when the string is created before anything else can
>> get a
>> >> pointer to the string buffer.
>> >>
>> >> Since it is opaque no one can rightly expect it to expose a particular
>> >> binary format so it is free to choose without compromising any expected
>> >> semantics.
>> >>
>> >> If someone can call buffer on an array then the FSR is a semantic
>> change.
>> >>
>> >> If a numpy 'U' array used the FSR and consisted only of ASCII
>> characters
>> >> then it would have a one byte per char buffer. What then happens if
>> you put
>> >> a higher code point in? The buffer needs to be resized and the data
>> copied
>> >> over. But then what happens to any buffer objects or array views? They
>> would
>> >> be pointing at the old buffer from before the resize. Subsequent
>> >> modifications to the resized array would not show up in other views
>> and vice
>> >> versa.
>> >>
>> >> I don't think that this can be done transparently since users of a
>> numpy
>> >> array need to know about the binary representation. That's why I
>> suggest a
>> >> dtype that has an encoding. Only in that way can it consistently have
>> both a
>> >> binary and a text interface.
>> >
>> >
>> > I didn't say we should change the S type, but that we should have
>> something,
>> > say 's', that appeared to python as a string. I think if we want
>> transparent
>> > string interoperability with python together with a compressed
>> > representation, and I think we need both, we are going to have to deal
>> with
>> > the difficulties of utf-8. That means raising errors if the string
>> doesn't
>> > fit in the allotted size, etc. Mind, this is a workaround for the mass
>> of
>> > ascii data that is already out there, not a substitute for 'U'.
>>
>> If we're going to be taking that much trouble, I'd suggest going ahead
>> and adding a variable-length string type (where the array itself
>> contains a pointer to a lookaside buffer, maybe with an optimization
>> for stashing short strings directly). The fixed-length requirement is
>> pretty onerous for lots of applications (e.g., pandas always uses
>> dtype="O" for strings -- and that might be a good workaround for some
>> people in this thread for now). The use of a lookaside buffer would
>> also make it practical to resize the buffer when the maximum code
>> point changed, for that matter...
>>
>
The more I think about it, the more I think we may need to do that. Note
that dynd has ragged arrays and I think they are implemented as pointers to
buffers. The easy way for us to do that would be a specialization of object
arrays to string types only as you suggest.



Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Charles R Harris
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith  wrote:

> On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
>  wrote:
> >
> >
> >
> > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
> oscar.j.benja...@gmail.com>
> > wrote:
> >>
> >>
> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" 
> >> wrote:
> >> >
> >> > I think we may want something like PEP 393. The S datatype may be the
> >> > wrong place to look, we might want a modification of U instead so as
> to
> >> > transparently get the benefit of python strings.
> >>
> >> The approach taken in PEP 393 (the FSR) makes more sense for str than it
> >> does for numpy arrays for two reasons: str is immutable and opaque.
> >>
> >> Since str is immutable the maximum code point in the string can be
> >> determined once when the string is created before anything else can get
> a
> >> pointer to the string buffer.
> >>
> >> Since it is opaque no one can rightly expect it to expose a particular
> >> binary format so it is free to choose without compromising any expected
> >> semantics.
> >>
> >> If someone can call buffer on an array then the FSR is a semantic
> change.
> >>
> >> If a numpy 'U' array used the FSR and consisted only of ASCII characters
> >> then it would have a one byte per char buffer. What then happens if you
> put
> >> a higher code point in? The buffer needs to be resized and the data
> copied
> >> over. But then what happens to any buffer objects or array views? They
> would
> >> be pointing at the old buffer from before the resize. Subsequent
> >> modifications to the resized array would not show up in other views and
> vice
> >> versa.
> >>
> >> I don't think that this can be done transparently since users of a numpy
> >> array need to know about the binary representation. That's why I
> suggest a
> >> dtype that has an encoding. Only in that way can it consistently have
> both a
> >> binary and a text interface.
> >
> >
> > I didn't say we should change the S type, but that we should have
> something,
> > say 's', that appeared to python as a string. I think if we want
> transparent
> > string interoperability with python together with a compressed
> > representation, and I think we need both, we are going to have to deal
> with
> > the difficulties of utf-8. That means raising errors if the string
> doesn't
> > fit in the allotted size, etc. Mind, this is a workaround for the mass of
> > ascii data that is already out there, not a substitute for 'U'.
>
> If we're going to be taking that much trouble, I'd suggest going ahead
> and adding a variable-length string type (where the array itself
> contains a pointer to a lookaside buffer, maybe with an optimization
> for stashing short strings directly). The fixed-length requirement is
> pretty onerous for lots of applications (e.g., pandas always uses
> dtype="O" for strings -- and that might be a good workaround for some
> people in this thread for now). The use of a lookaside buffer would
> also make it practical to resize the buffer when the maximum code
> point changed, for that matter...
>
> Though, IMO any new dtype here would need a cleanup of the dtype code
> first so that it doesn't require yet more massive special cases all
> over umath.so.
>

Worth thinking about. As another alternative, what is the minimum we need
to make a restricted encoding, say latin-1, appear transparently as a
unicode string to python? I know the python folks don't like this much, but
I suspect something along that line will eventually be required for the
http folks.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Nathaniel Smith
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
 wrote:
>
>
>
> On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin 
> wrote:
>>
>>
>> On Jan 20, 2014 8:35 PM, "Charles R Harris" 
>> wrote:
>> >
>> > I think we may want something like PEP 393. The S datatype may be the
>> > wrong place to look, we might want a modification of U instead so as to
>> > transparently get the benefit of python strings.
>>
>> The approach taken in PEP 393 (the FSR) makes more sense for str than it
>> does for numpy arrays for two reasons: str is immutable and opaque.
>>
>> Since str is immutable the maximum code point in the string can be
>> determined once when the string is created before anything else can get a
>> pointer to the string buffer.
>>
>> Since it is opaque no one can rightly expect it to expose a particular
>> binary format so it is free to choose without compromising any expected
>> semantics.
>>
>> If someone can call buffer on an array then the FSR is a semantic change.
>>
>> If a numpy 'U' array used the FSR and consisted only of ASCII characters
>> then it would have a one byte per char buffer. What then happens if you put
>> a higher code point in? The buffer needs to be resized and the data copied
>> over. But then what happens to any buffer objects or array views? They would
>> be pointing at the old buffer from before the resize. Subsequent
>> modifications to the resized array would not show up in other views and vice
>> versa.
>>
>> I don't think that this can be done transparently since users of a numpy
>> array need to know about the binary representation. That's why I suggest a
>> dtype that has an encoding. Only in that way can it consistently have both a
>> binary and a text interface.
>
>
> I didn't say we should change the S type, but that we should have something,
> say 's', that appeared to python as a string. I think if we want transparent
> string interoperability with python together with a compressed
> representation, and I think we need both, we are going to have to deal with
> the difficulties of utf-8. That means raising errors if the string doesn't
> fit in the allotted size, etc. Mind, this is a workaround for the mass of
> ascii data that is already out there, not a substitute for 'U'.

If we're going to be taking that much trouble, I'd suggest going ahead
and adding a variable-length string type (where the array itself
contains a pointer to a lookaside buffer, maybe with an optimization
for stashing short strings directly). The fixed-length requirement is
pretty onerous for lots of applications (e.g., pandas always uses
dtype="O" for strings -- and that might be a good workaround for some
people in this thread for now). The use of a lookaside buffer would
also make it practical to resize the buffer when the maximum code
point changed, for that matter...

Though, IMO any new dtype here would need a cleanup of the dtype code
first so that it doesn't require yet more massive special cases all
over umath.so.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Charles R Harris
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin
wrote:

>
> On Jan 20, 2014 8:35 PM, "Charles R Harris" 
> wrote:
> >
> > I think we may want something like PEP 393. The S datatype may be the
> wrong place to look, we might want a modification of U instead so as to
> transparently get the benefit of python strings.
>
> The approach taken in PEP 393 (the FSR) makes more sense for str than it
> does for numpy arrays for two reasons: str is immutable and opaque.
>
> Since str is immutable the maximum code point in the string can be
> determined once when the string is created before anything else can get a
> pointer to the string buffer.
>
> Since it is opaque no one can rightly expect it to expose a particular
> binary format so it is free to choose without compromising any expected
> semantics.
>
> If someone can call buffer on an array then the FSR is a semantic change.
>
> If a numpy 'U' array used the FSR and consisted only of ASCII characters
> then it would have a one byte per char buffer. What then happens if you put
> a higher code point in? The buffer needs to be resized and the data copied
> over. But then what happens to any buffer objects or array views? They
> would be pointing at the old buffer from before the resize. Subsequent
> modifications to the resized array would not show up in other views and
> vice versa.
>
> I don't think that this can be done transparently since users of a numpy
> array need to know about the binary representation. That's why I suggest a
> dtype that has an encoding. Only in that way can it consistently have both
> a binary and a text interface.
>

I didn't say we should change the S type, but that we should have
something, say 's', that appeared to python as a string. I think if we want
transparent string interoperability with python together with a compressed
representation, and I think we need both, we are going to have to deal with
the difficulties of utf-8. That means raising errors if the string doesn't
fit in the allotted size, etc. Mind, this is a workaround for the mass of
ascii data that is already out there, not a substitute for 'U'.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Oscar Benjamin
On Jan 20, 2014 8:35 PM, "Charles R Harris" 
wrote:
>
> I think we may want something like PEP 393. The S datatype may be the
wrong place to look, we might want a modification of U instead so as to
transparently get the benefit of python strings.

The approach taken in PEP 393 (the FSR) makes more sense for str than it
does for numpy arrays for two reasons: str is immutable and opaque.

Since str is immutable the maximum code point in the string can be
determined once when the string is created before anything else can get a
pointer to the string buffer.

Since it is opaque no one can rightly expect it to expose a particular
binary format so it is free to choose without compromising any expected
semantics.

If someone can call buffer on an array then the FSR is a semantic change.

If a numpy 'U' array used the FSR and consisted only of ASCII characters
then it would have a one byte per char buffer. What then happens if you put
a higher code point in? The buffer needs to be resized and the data copied
over. But then what happens to any buffer objects or array views? They
would be pointing at the old buffer from before the resize. Subsequent
modifications to the resized array would not show up in other views and
vice versa.

I don't think that this can be done transparently since users of a numpy
array need to know about the binary representation. That's why I suggest a
dtype that has an encoding. Only in that way can it consistently have both
a binary and a text interface.

Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Charles R Harris
On Mon, Jan 20, 2014 at 11:40 AM, Oscar Benjamin  wrote:

>
> On Jan 20, 2014 5:21 PM, "Charles R Harris" 
> wrote:
> > On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <
> aldcr...@head.cfa.harvard.edu> wrote:
> >> On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <
> oscar.j.benja...@gmail.com> wrote:
> >>> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
> >>> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
> >>>
> >>> And why are you needing to write .decode('ascii') everywhere?
> >>
> >> >>> print("The first value is {}".format(bytestring_array[0]))
> >>
> >> On Python 2 this gives "The first value is string_value", while on
> Python 3 this gives "The first value is b'string_value'".
> >
> > As Nathaniel has mentioned, this is a known problem with Python 3 and
> the developers are trying to come up with a solution. Python 3.4 solves
> some existing problems, but this one remains. It's not just numpy here,
> it's that python itself needs to provide some help.
>
> If you think that anything in core Python will change so that you can mix
> text and bytes as above then I think you are very much mistaken. If you're
> referring to PEP 460/461 then you have misunderstood the purpose of those
> PEPs. The authors and reviewers will carefully ensure that nothing changes
> to make the above work the way that it did in 2.x.
>

I think we may want something like PEP
393.
The S datatype may be the wrong place to look, we might want a modification
of U instead so as to transparently get the benefit of python strings.

Chuck

>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread josef . pktd
On Mon, Jan 20, 2014 at 12:12 PM, Aldcroft, Thomas
 wrote:
>
>
>
> On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin
>  wrote:
>>
>> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
>> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
>> > wrote:
>> > > How significant are the performance issues? Does anyone really use
>> > > numpy
>> > > for
>> > > this kind of text handling? If you really are operating on gigantic
>> > > text
>> > > arrays of ascii characters then is it so bad to just use the bytes
>> > > dtype
>> > > and
>> > > handle decoding/encoding at the boundaries? If you're not operating on
>> > > gigantic text arrays is there really a noticeable problem just using
>> > > the
>> > > 'U'
>> > > dtype?
>> > >
>> >
>> > I use numpy for giga-row arrays of short text strings, so memory and
>> > performance issues are real.
>> >
>> > As discussed in the previous parent thread, using the bytes dtype is
>> > really
>> > a problem because users of a text array want to do things like filtering
>> > (`match_rows = text_array == 'match'`), printing, or other manipulations
>> > in
>> > a natural way without having to continually use bytestring literals or
>> > `.decode('ascii')` everywhere.  I tried converting a few packages while
>> > leaving the arrays as bytestrings and it just ended up as a very big
>> > mess.
>> >
>> > From my perspective the goal here is to provide a pragmatic way to allow
>> > numpy-based applications and end users to use python 3.  Something like
>> > this proposal seems to be the right direction, maybe not pure and
>> > perfect
>> > but a sensible step to get us there given the reality of scientific
>> > computing.
>>
>> I don't really see how writing b'match' instead of 'match' is that big a
>> deal.
>
>
> It's a big deal because all your existing python 2 code suddenly breaks on
> python 3, even after running 2to3.  Yes, you can backfix all the python 2
> code and use bytestring literals everywhere, but that is very painful and
> ugly.  More importantly it's very fiddly because *sometimes* you'll need to
> use bytestring literals, and *sometimes* not, depending on the exact dataset
> you've been handed.  That's basically a non-starter.
>
> As you say below, the only solution is a proper separation of bytes/unicode
> where everything internally is unicode.  The problem is that the existing
> 4-byte unicode in numpy is a big performance / memory hit.  It's even
> trickier because libraries will happily deliver a numpy structured array
> with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to
> then convert to 'U' since you need to remake the entire structured array.
> With a one-byte unicode the goal would be an in-place update of 'S' to 's'.
>
>>
>> And why are you needing to write .decode('ascii') everywhere?
>
>
 print("The first value is {}".format(bytestring_array[0]))
>
> On Python 2 this gives "The first value is string_value", while on Python 3
> this gives "The first value is b'string_value'".

Unfortunately (?) setprintoptions  and set_string_function don't work
with numpy scalars AFAICS. If it did then it would be possible to
override the string representation. It works for arrays.

I didn't find the right key for numpy.bytes_ on  python 3.3 so now my
interpreter can only print bytes
np.set_printoptions(formatter={'all':lambda x:
x.decode('ascii',errors="ignore") })

Josef

>
>>
>> If you really
>> do just want to work with bytes in your own known encoding then why not
>> just
>> read and write in binary mode?
>>
>> I apologise if I'm wrong but I suspect that much of the difficulty in
>> getting
>> the bytes/unicode separation right is down to the fact that a lot of the
>> code
>> you're using (or attempting to support) hasn't yet been ported to a clean
>> text
>> model. When I started using Python 3 it took me quite a few failed
>> attempts
>> at understanding the text model before I got to the point where I
>> understood
>> how it is supposed to be used. The problem was that I had been conflating
>> text
>> and bytes in many places, and that's hard to disentangle. Having fixed
>> most of
>> those problems I now understand why it is such an improvement.
>>
>> In any case I don't see anything wrong with a more efficient dtype for
>> representing text if the user can specify the encoding. The problem is
>> that
>> numpy arrays expose their underlying memory buffer. Allowing them to
>> interact
>> directly with text strings on the one side and binary files on the other
>> breaches Python 3's very good text model unless the user can specify the
>> encoding that is to be used. Or at least if there is to be a blessed
>> encoding
>> then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
>>
>>
>> Oscar
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> ___

Re: [Numpy-discussion] A one-byte string dtype? (Charles R Harris)

2014-01-20 Thread David Goldsmith
On Mon, Jan 20, 2014 at 9:11 AM,  wrote:

> I think that is right. Not having an effective way to handle these common
> scientific data sets will block acceptance of Python 3. But we do need to
> figure out the best way to add this functionality.
>
> Chuck
>

Sounds like it might be time for some formal data collection, e.g., a
wiki-poll of users' use-cases.  (I know this wouldn't be exhaustive, but at
least it will provide guidance and a "checklist" of situations we should be
sure our solution covers.)

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Oscar Benjamin
On Jan 20, 2014 5:21 PM, "Charles R Harris" 
wrote:
> On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:
>> On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <
oscar.j.benja...@gmail.com> wrote:
>>> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
>>> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
>>>
>>> And why are you needing to write .decode('ascii') everywhere?
>>
>> >>> print("The first value is {}".format(bytestring_array[0]))
>>
>> On Python 2 this gives "The first value is string_value", while on
Python 3 this gives "The first value is b'string_value'".
>
> As Nathaniel has mentioned, this is a known problem with Python 3 and the
developers are trying to come up with a solution. Python 3.4 solves some
existing problems, but this one remains. It's not just numpy here, it's
that python itself needs to provide some help.

If you think that anything in core Python will change so that you can mix
text and bytes as above then I think you are very much mistaken. If you're
referring to PEP 460/461 then you have misunderstood the purpose of those
PEPs. The authors and reviewers will carefully ensure that nothing changes
to make the above work the way that it did in 2.x.

Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Charles R Harris
On Mon, Jan 20, 2014 at 10:12 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:

>
>
>
> On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin <
> oscar.j.benja...@gmail.com> wrote:
>
>> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
>> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
>> > wrote:
>> > > How significant are the performance issues? Does anyone really use
>> numpy
>> > > for
>> > > this kind of text handling? If you really are operating on gigantic
>> text
>> > > arrays of ascii characters then is it so bad to just use the bytes
>> dtype
>> > > and
>> > > handle decoding/encoding at the boundaries? If you're not operating on
>> > > gigantic text arrays is there really a noticeable problem just using
>> the
>> > > 'U'
>> > > dtype?
>> > >
>> >
>> > I use numpy for giga-row arrays of short text strings, so memory and
>> > performance issues are real.
>> >
>> > As discussed in the previous parent thread, using the bytes dtype is
>> really
>> > a problem because users of a text array want to do things like filtering
>> > (`match_rows = text_array == 'match'`), printing, or other
>> manipulations in
>> > a natural way without having to continually use bytestring literals or
>> > `.decode('ascii')` everywhere.  I tried converting a few packages while
>> > leaving the arrays as bytestrings and it just ended up as a very big
>> mess.
>> >
>> > From my perspective the goal here is to provide a pragmatic way to allow
>> > numpy-based applications and end users to use python 3.  Something like
>> > this proposal seems to be the right direction, maybe not pure and
>> perfect
>> > but a sensible step to get us there given the reality of scientific
>> > computing.
>>
>> I don't really see how writing b'match' instead of 'match' is that big a
>> deal.
>>
>
> It's a big deal because all your existing python 2 code suddenly breaks on
> python 3, even after running 2to3.  Yes, you can backfix all the python 2
> code and use bytestring literals everywhere, but that is very painful and
> ugly.  More importantly it's very fiddly because *sometimes* you'll need to
> use bytestring literals, and *sometimes* not, depending on the exact
> dataset you've been handed.  That's basically a non-starter.
>
> As you say below, the only solution is a proper separation of
> bytes/unicode where everything internally is unicode.  The problem is that
> the existing 4-byte unicode in numpy is a big performance / memory hit.
>  It's even trickier because libraries will happily deliver a numpy
> structured array with an 'S'-dtype field (from a binary dataset on disk),
> and it's a pain to then convert to 'U' since you need to remake the entire
> structured array.  With a one-byte unicode the goal would be an in-place
> update of 'S' to 's'.
>
>
>> And why are you needing to write .decode('ascii') everywhere?
>
>
> >>> print("The first value is {}".format(bytestring_array[0]))
>
> On Python 2 this gives "The first value is string_value", while on Python
> 3 this gives "The first value is b'string_value'".
>

As Nathaniel has mentioned, this is a known problem with Python 3 and the
developers are trying to come up with a solution. Python 3.4 solves some
existing problems, but this one remains. It's not just numpy here, it's
that python itself needs to provide some help.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Charles R Harris
On Mon, Jan 20, 2014 at 8:00 AM, Aldcroft, Thomas <
aldcr...@head.cfa.harvard.edu> wrote:

>
>
>
> On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <
> oscar.j.benja...@gmail.com> wrote:
>
>> On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote:
>> > Folks,
>> >
>> > I've been blathering away on the related threads a lot -- sorry if it's
>> too
>> > much. It's gotten a bit tangled up, so I thought I'd start a new one to
>> > address this one question (i.e. dont bring up genfromtext here):
>> >
>> > Would it be a good thing for numpy to have a one-byte--per-character
>> string
>> > type?
>>
>> If you mean a string type that can only hold latin-1 characters then I
>> think
>> that this is a step backwards.
>>
>> If you mean a dtype that holds bytes in a known, specifiable encoding and
>> automatically decodes them to unicode strings when you call .item() and
>> has a
>> friendly repr() then that may be a good idea.
>>
>> So for example you could have dtype='S:utf-8' which would store strings
>> encoded as utf-8 e.g.:
>>
>> >>> text = array(['foo', 'bar'], dtype='S:utf-8')
>> >>> text
>> array(['foo', 'bar'], dtype='|S3:utf-8')
>> >>> print(a)
>> ['foo', 'bar']
>> >>> a[0]
>> 'foo'
>> >>> a.nbytes
>> 6
>>
>> > We did have that with the 'S' type in py2, but the changes in py3 have
>> made
>> > it not quite the right thing. And it appears that enough people use 'S'
>> in
>> > py3 to mean 'bytes', so that we can't change that now.
>>
>> It wasn't really the right thing before either. That's why Python 3 has
>> changed all of this.
>>
>> > The only difference may be that 'S' currently auto translates to a bytes
>> > object, resulting in things like:
>> >
>> > np.array(['some text',],  dtype='S')[0] == 'some text'
>> >
>> > yielding False on Py3. And you can't do all the usual text stuff with
>> the
>> > resulting bytes object, either. (and it probably used the default
>> encoding
>> > to generate the bytes, so will barf on some inputs, though that may be
>> > unavoidable.) So you need to decode the bytes that are given back, and
>> now
>> > that I think about it, I have no idea what encoding you'd need to use in
>> > the general case.
>>
>> You should let the user specify the encoding or otherwise require them to
>> use
>> the 'U' dtype.
>>
>> > So the correct solution is (particularly on py3) to use the 'U'
>> (unicode)
>> > dtype for text in numpy arrays.
>>
>> Absolutely. Embrace the Python 3 text model. Once you understand the how,
>> what
>> and why of it you'll see that it really is a good thing!
>>
>> > However, the 'U' dtype is 4 bytes per character, and that may be "too
>> big"
>> > for some use-cases. And there is a lot of text in scientific data sets
>> that
>> > are pure ascii, or at least some 1-byte-per-character encoding.
>> >
>> > So, in the spirit of having multiple numeric types that use different
>> > amounts of memory, and can hold different ranges of values, a
>> one-byte-per
>> > character dtype would be nice:
>> >
>> > (note, this opens the door for a 2-byte per (UCS-2) dtype too, I
>> personally
>> > don't think that's worth it, but maybe that's because I'm an english
>> > speaker...)
>>
>> You could just use a 2-byte encoding with the S dtype e.g.
>> dtype='S:utf-16-le'.
>>
>> > It could use the 's' (lower-case s) type identifier.
>> >
>> > For passing to/from python built-in objects, it would
>> >
>> > * Allow either Python bytes objects or Python unicode objects as input
>> >  a) bytes objects would be passed through as-is
>> >  b) unicode objects would be encoded as latin-1
>> >
>> > [note: I'm not entirely sure that bytes objects should be allowed, but
>> it
>> > would provide an nice efficiency in a fairly common case]
>>
>> I think it would be a bad idea to accept bytes here. There are good
>> reasons
>> that Python 3 creates a barrier between the two worlds of text and bytes.
>> Allowing implicit mixing of bytes and text is a recipe for mojibake. The
>> TypeErrors in Python 3 are used to guard against conceptual errors that
>> lead
>> to data corruption. Attempting to undermine that barrier in numpy would
>> be a
>> backward step.
>>
>> I apologise if this is misplaced but there seems to be an attitude that
>> scientific programming isn't really affected by the issues that have lead
>> to
>> the Python 3 text model. I think that's ridiculous; data corruption is a
>> problem in scientific programming just as it is anywhere else.
>>
>> > * It would create python unicode text objects, decoded as latin-1.
>>
>> Don't try to bless a particular encoding and stop trying to pretend that
>> it's
>> possible to write a sensible system where end users don't need to worry
>> about
>> and specify the encoding of their data.
>>
>> > Could we have a way to specify another encoding? I'm not sure how that
>> > would fit into the dtype system.
>>
>> If the encoding cannot be specified then the whole idea is misguided.
>>
>> > I've explained the latin-1 thing on other threads, but the 

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Aldcroft, Thomas
On Mon, Jan 20, 2014 at 10:40 AM, Oscar Benjamin  wrote:

> On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
> > On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
> > wrote:
> > > How significant are the performance issues? Does anyone really use
> numpy
> > > for
> > > this kind of text handling? If you really are operating on gigantic
> text
> > > arrays of ascii characters then is it so bad to just use the bytes
> dtype
> > > and
> > > handle decoding/encoding at the boundaries? If you're not operating on
> > > gigantic text arrays is there really a noticeable problem just using
> the
> > > 'U'
> > > dtype?
> > >
> >
> > I use numpy for giga-row arrays of short text strings, so memory and
> > performance issues are real.
> >
> > As discussed in the previous parent thread, using the bytes dtype is
> really
> > a problem because users of a text array want to do things like filtering
> > (`match_rows = text_array == 'match'`), printing, or other manipulations
> in
> > a natural way without having to continually use bytestring literals or
> > `.decode('ascii')` everywhere.  I tried converting a few packages while
> > leaving the arrays as bytestrings and it just ended up as a very big
> mess.
> >
> > From my perspective the goal here is to provide a pragmatic way to allow
> > numpy-based applications and end users to use python 3.  Something like
> > this proposal seems to be the right direction, maybe not pure and perfect
> > but a sensible step to get us there given the reality of scientific
> > computing.
>
> I don't really see how writing b'match' instead of 'match' is that big a
> deal.
>

It's a big deal because all your existing python 2 code suddenly breaks on
python 3, even after running 2to3.  Yes, you can backfix all the python 2
code and use bytestring literals everywhere, but that is very painful and
ugly.  More importantly it's very fiddly because *sometimes* you'll need to
use bytestring literals, and *sometimes* not, depending on the exact
dataset you've been handed.  That's basically a non-starter.

As you say below, the only solution is a proper separation of bytes/unicode
where everything internally is unicode.  The problem is that the existing
4-byte unicode in numpy is a big performance / memory hit.  It's even
trickier because libraries will happily deliver a numpy structured array
with an 'S'-dtype field (from a binary dataset on disk), and it's a pain to
then convert to 'U' since you need to remake the entire structured array.
 With a one-byte unicode the goal would be an in-place update of 'S' to 's'.


> And why are you needing to write .decode('ascii') everywhere?


>>> print("The first value is {}".format(bytestring_array[0]))

On Python 2 this gives "The first value is string_value", while on Python 3
this gives "The first value is b'string_value'".


> If you really
> do just want to work with bytes in your own known encoding then why not
> just
> read and write in binary mode?
>
> I apologise if I'm wrong but I suspect that much of the difficulty in
> getting
> the bytes/unicode separation right is down to the fact that a lot of the
> code
> you're using (or attempting to support) hasn't yet been ported to a clean
> text
> model. When I started using Python 3 it took me quite a few failed attempts
> at understanding the text model before I got to the point where I
> understood
> how it is supposed to be used. The problem was that I had been conflating
> text
> and bytes in many places, and that's hard to disentangle. Having fixed
> most of
> those problems I now understand why it is such an improvement.
>
> In any case I don't see anything wrong with a more efficient dtype for
> representing text if the user can specify the encoding. The problem is that
> numpy arrays expose their underlying memory buffer. Allowing them to
> interact
> directly with text strings on the one side and binary files on the other
> breaches Python 3's very good text model unless the user can specify the
> encoding that is to be used. Or at least if there is to be a blessed
> encoding
> then make it unicode-capable utf-8 instead of legacy ascii/latin-1.
>
>
> Oscar
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Oscar Benjamin
On Mon, Jan 20, 2014 at 10:00:55AM -0500, Aldcroft, Thomas wrote:
> On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
> wrote:
> > How significant are the performance issues? Does anyone really use numpy
> > for
> > this kind of text handling? If you really are operating on gigantic text
> > arrays of ascii characters then is it so bad to just use the bytes dtype
> > and
> > handle decoding/encoding at the boundaries? If you're not operating on
> > gigantic text arrays is there really a noticeable problem just using the
> > 'U'
> > dtype?
> >
>
> I use numpy for giga-row arrays of short text strings, so memory and
> performance issues are real.
>
> As discussed in the previous parent thread, using the bytes dtype is really
> a problem because users of a text array want to do things like filtering
> (`match_rows = text_array == 'match'`), printing, or other manipulations in
> a natural way without having to continually use bytestring literals or
> `.decode('ascii')` everywhere.  I tried converting a few packages while
> leaving the arrays as bytestrings and it just ended up as a very big mess.
>
> From my perspective the goal here is to provide a pragmatic way to allow
> numpy-based applications and end users to use python 3.  Something like
> this proposal seems to be the right direction, maybe not pure and perfect
> but a sensible step to get us there given the reality of scientific
> computing.

I don't really see how writing b'match' instead of 'match' is that big a deal.
And why are you needing to write .decode('ascii') everywhere? If you really
do just want to work with bytes in your own known encoding then why not just
read and write in binary mode?

I apologise if I'm wrong but I suspect that much of the difficulty in getting
the bytes/unicode separation right is down to the fact that a lot of the code
you're using (or attempting to support) hasn't yet been ported to a clean text
model. When I started using Python 3 it took me quite a few failed attempts
at understanding the text model before I got to the point where I understood
how it is supposed to be used. The problem was that I had been conflating text
and bytes in many places, and that's hard to disentangle. Having fixed most of
those problems I now understand why it is such an improvement.

In any case I don't see anything wrong with a more efficient dtype for
representing text if the user can specify the encoding. The problem is that
numpy arrays expose their underlying memory buffer. Allowing them to interact
directly with text strings on the one side and binary files on the other
breaches Python 3's very good text model unless the user can specify the
encoding that is to be used. Or at least if there is to be a blessed encoding
then make it unicode-capable utf-8 instead of legacy ascii/latin-1.


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Aldcroft, Thomas
On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin
wrote:

> On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote:
> > Folks,
> >
> > I've been blathering away on the related threads a lot -- sorry if it's
> too
> > much. It's gotten a bit tangled up, so I thought I'd start a new one to
> > address this one question (i.e. dont bring up genfromtext here):
> >
> > Would it be a good thing for numpy to have a one-byte--per-character
> string
> > type?
>
> If you mean a string type that can only hold latin-1 characters then I
> think
> that this is a step backwards.
>
> If you mean a dtype that holds bytes in a known, specifiable encoding and
> automatically decodes them to unicode strings when you call .item() and
> has a
> friendly repr() then that may be a good idea.
>
> So for example you could have dtype='S:utf-8' which would store strings
> encoded as utf-8 e.g.:
>
> >>> text = array(['foo', 'bar'], dtype='S:utf-8')
> >>> text
> array(['foo', 'bar'], dtype='|S3:utf-8')
> >>> print(a)
> ['foo', 'bar']
> >>> a[0]
> 'foo'
> >>> a.nbytes
> 6
>
> > We did have that with the 'S' type in py2, but the changes in py3 have
> made
> > it not quite the right thing. And it appears that enough people use 'S'
> in
> > py3 to mean 'bytes', so that we can't change that now.
>
> It wasn't really the right thing before either. That's why Python 3 has
> changed all of this.
>
> > The only difference may be that 'S' currently auto translates to a bytes
> > object, resulting in things like:
> >
> > np.array(['some text',],  dtype='S')[0] == 'some text'
> >
> > yielding False on Py3. And you can't do all the usual text stuff with the
> > resulting bytes object, either. (and it probably used the default
> encoding
> > to generate the bytes, so will barf on some inputs, though that may be
> > unavoidable.) So you need to decode the bytes that are given back, and
> now
> > that I think about it, I have no idea what encoding you'd need to use in
> > the general case.
>
> You should let the user specify the encoding or otherwise require them to
> use
> the 'U' dtype.
>
> > So the correct solution is (particularly on py3) to use the 'U' (unicode)
> > dtype for text in numpy arrays.
>
> Absolutely. Embrace the Python 3 text model. Once you understand the how,
> what
> and why of it you'll see that it really is a good thing!
>
> > However, the 'U' dtype is 4 bytes per character, and that may be "too
> big"
> > for some use-cases. And there is a lot of text in scientific data sets
> that
> > are pure ascii, or at least some 1-byte-per-character encoding.
> >
> > So, in the spirit of having multiple numeric types that use different
> > amounts of memory, and can hold different ranges of values, a
> one-byte-per
> > character dtype would be nice:
> >
> > (note, this opens the door for a 2-byte per (UCS-2) dtype too, I
> personally
> > don't think that's worth it, but maybe that's because I'm an english
> > speaker...)
>
> You could just use a 2-byte encoding with the S dtype e.g.
> dtype='S:utf-16-le'.
>
> > It could use the 's' (lower-case s) type identifier.
> >
> > For passing to/from python built-in objects, it would
> >
> > * Allow either Python bytes objects or Python unicode objects as input
> >  a) bytes objects would be passed through as-is
> >  b) unicode objects would be encoded as latin-1
> >
> > [note: I'm not entirely sure that bytes objects should be allowed, but it
> > would provide an nice efficiency in a fairly common case]
>
> I think it would be a bad idea to accept bytes here. There are good reasons
> that Python 3 creates a barrier between the two worlds of text and bytes.
> Allowing implicit mixing of bytes and text is a recipe for mojibake. The
> TypeErrors in Python 3 are used to guard against conceptual errors that
> lead
> to data corruption. Attempting to undermine that barrier in numpy would be
> a
> backward step.
>
> I apologise if this is misplaced but there seems to be an attitude that
> scientific programming isn't really affected by the issues that have lead
> to
> the Python 3 text model. I think that's ridiculous; data corruption is a
> problem in scientific programming just as it is anywhere else.
>
> > * It would create python unicode text objects, decoded as latin-1.
>
> Don't try to bless a particular encoding and stop trying to pretend that
> it's
> possible to write a sensible system where end users don't need to worry
> about
> and specify the encoding of their data.
>
> > Could we have a way to specify another encoding? I'm not sure how that
> > would fit into the dtype system.
>
> If the encoding cannot be specified then the whole idea is misguided.
>
> > I've explained the latin-1 thing on other threads, but the short version
> is:
> >
> >  - It will work perfectly for ascii text
> >  - It will work perfectly for latin-1 text (natch)
> >  - It will never give you an UnicodeEncodeError regardless of what
> > arbitrary bytes you pass in.
> >  - It will preserve those arbitrary byt

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-20 Thread Oscar Benjamin
On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote:
> Folks,
> 
> I've been blathering away on the related threads a lot -- sorry if it's too
> much. It's gotten a bit tangled up, so I thought I'd start a new one to
> address this one question (i.e. dont bring up genfromtext here):
> 
> Would it be a good thing for numpy to have a one-byte--per-character string
> type?

If you mean a string type that can only hold latin-1 characters then I think
that this is a step backwards.

If you mean a dtype that holds bytes in a known, specifiable encoding and
automatically decodes them to unicode strings when you call .item() and has a
friendly repr() then that may be a good idea.

So for example you could have dtype='S:utf-8' which would store strings
encoded as utf-8 e.g.:

>>> text = array(['foo', 'bar'], dtype='S:utf-8')
>>> text
array(['foo', 'bar'], dtype='|S3:utf-8')
>>> print(a)
['foo', 'bar']
>>> a[0]
'foo'
>>> a.nbytes
6

> We did have that with the 'S' type in py2, but the changes in py3 have made
> it not quite the right thing. And it appears that enough people use 'S' in
> py3 to mean 'bytes', so that we can't change that now.

It wasn't really the right thing before either. That's why Python 3 has
changed all of this.

> The only difference may be that 'S' currently auto translates to a bytes
> object, resulting in things like:
> 
> np.array(['some text',],  dtype='S')[0] == 'some text'
> 
> yielding False on Py3. And you can't do all the usual text stuff with the
> resulting bytes object, either. (and it probably used the default encoding
> to generate the bytes, so will barf on some inputs, though that may be
> unavoidable.) So you need to decode the bytes that are given back, and now
> that I think about it, I have no idea what encoding you'd need to use in
> the general case.

You should let the user specify the encoding or otherwise require them to use
the 'U' dtype.

> So the correct solution is (particularly on py3) to use the 'U' (unicode)
> dtype for text in numpy arrays.

Absolutely. Embrace the Python 3 text model. Once you understand the how, what
and why of it you'll see that it really is a good thing!

> However, the 'U' dtype is 4 bytes per character, and that may be "too big"
> for some use-cases. And there is a lot of text in scientific data sets that
> are pure ascii, or at least some 1-byte-per-character encoding.
> 
> So, in the spirit of having multiple numeric types that use different
> amounts of memory, and can hold different ranges of values, a one-byte-per
> character dtype would be nice:
> 
> (note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally
> don't think that's worth it, but maybe that's because I'm an english
> speaker...)

You could just use a 2-byte encoding with the S dtype e.g.
dtype='S:utf-16-le'.

> It could use the 's' (lower-case s) type identifier.
> 
> For passing to/from python built-in objects, it would
> 
> * Allow either Python bytes objects or Python unicode objects as input
>  a) bytes objects would be passed through as-is
>  b) unicode objects would be encoded as latin-1
>
> [note: I'm not entirely sure that bytes objects should be allowed, but it
> would provide an nice efficiency in a fairly common case]

I think it would be a bad idea to accept bytes here. There are good reasons
that Python 3 creates a barrier between the two worlds of text and bytes.
Allowing implicit mixing of bytes and text is a recipe for mojibake. The
TypeErrors in Python 3 are used to guard against conceptual errors that lead
to data corruption. Attempting to undermine that barrier in numpy would be a
backward step.

I apologise if this is misplaced but there seems to be an attitude that
scientific programming isn't really affected by the issues that have lead to
the Python 3 text model. I think that's ridiculous; data corruption is a
problem in scientific programming just as it is anywhere else.

> * It would create python unicode text objects, decoded as latin-1.

Don't try to bless a particular encoding and stop trying to pretend that it's
possible to write a sensible system where end users don't need to worry about
and specify the encoding of their data.

> Could we have a way to specify another encoding? I'm not sure how that
> would fit into the dtype system.

If the encoding cannot be specified then the whole idea is misguided.

> I've explained the latin-1 thing on other threads, but the short version is:
> 
>  - It will work perfectly for ascii text
>  - It will work perfectly for latin-1 text (natch)
>  - It will never give you an UnicodeEncodeError regardless of what
> arbitrary bytes you pass in.
>  - It will preserve those arbitrary bytes through a encoding/decoding
> operation.

So what happens if I do:

>>> with open('myutf-8-file.txt', 'rb') as fin:
... text = numpy.fromfile(fin, dtype='s')
>>> text[0] # Decodes as latin-1 leading to mojibake.

I would propose that it's better to be able to do:

>>> with open('myutf-8-

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-17 Thread Aldcroft, Thomas
On Fri, Jan 17, 2014 at 5:30 PM, Chris Barker  wrote:

> Folks,
>
> I've been blathering away on the related threads a lot -- sorry if it's
> too much. It's gotten a bit tangled up, so I thought I'd start a new one to
> address this one question (i.e. dont bring up genfromtext here):
>
> Would it be a good thing for numpy to have a one-byte--per-character
> string type?
>
> We did have that with the 'S' type in py2, but the changes in py3 have
> made it not quite the right thing. And it appears that enough people use
> 'S' in py3 to mean 'bytes', so that we can't change that now.
>
> The only difference may be that 'S' currently auto translates to a bytes
> object, resulting in things like:
>
> np.array(['some text',],  dtype='S')[0] == 'some text'
>
> yielding False on Py3. And you can't do all the usual text stuff with the
> resulting bytes object, either. (and it probably used the default encoding
> to generate the bytes, so will barf on some inputs, though that may be
> unavoidable.) So you need to decode the bytes that are given back, and now
> that I think about it, I have no idea what encoding you'd need to use in
> the general case.
>
> So the correct solution is (particularly on py3) to use the 'U' (unicode)
> dtype for text in numpy arrays.
>
> However, the 'U' dtype is 4 bytes per character, and that may be "too big"
> for some use-cases. And there is a lot of text in scientific data sets that
> are pure ascii, or at least some 1-byte-per-character encoding.
>
> So, in the spirit of having multiple numeric types that use different
> amounts of memory, and can hold different ranges of values, a one-byte-per
> character dtype would be nice:
>
> (note, this opens the door for a 2-byte per (UCS-2) dtype too, I
> personally don't think that's worth it, but maybe that's because I'm an
> english speaker...)
>

>  It could use the 's' (lower-case s) type identifier.
>
> For passing to/from python built-in objects, it would
>
> * Allow either Python bytes objects or Python unicode objects as input
>  a) bytes objects would be passed through as-is
>  b) unicode objects would be encoded as latin-1
>
> [note: I'm not entirely sure that bytes objects should be allowed, but it
> would provide an nice efficiency in a fairly common case]
>
> * It would create python unicode text objects, decoded as latin-1.
>
> Could we have a way to specify another encoding? I'm not sure how that
> would fit into the dtype system.
>
> I've explained the latin-1 thing on other threads, but the short version
> is:
>
>  - It will work perfectly for ascii text
>  - It will work perfectly for latin-1 text (natch)
>  - It will never give you an UnicodeEncodeError regardless of what
> arbitrary bytes you pass in.
>  - It will preserve those arbitrary bytes through a encoding/decoding
> operation.
>
> (it still wouldn't allow you to store arbitrary unicode -- but that's the
> limitation of one-byte per character...)
>
> So:
>
> Bad idea all around: shut up already!
>
> or
>
> Fine idea, but who's going to write the code? not me!
>
> or
>
> We really should do this.
>

As evident from what I said in the previous thread, YES, this should really
be done!

One important feature would be changing the dtype from 'S' to 's' without
any memory copies, so that conversion would be very cheap.  Maybe this
would essentially come for free with something like astype('s', copy=False).

- Tom


>
> (of course, with the options of amending the above not-very-fleshed out
> proposal)
>
> -Chris
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] A one-byte string dtype?

2014-01-17 Thread Chris Barker
Folks,

I've been blathering away on the related threads a lot -- sorry if it's too
much. It's gotten a bit tangled up, so I thought I'd start a new one to
address this one question (i.e. dont bring up genfromtext here):

Would it be a good thing for numpy to have a one-byte--per-character string
type?

We did have that with the 'S' type in py2, but the changes in py3 have made
it not quite the right thing. And it appears that enough people use 'S' in
py3 to mean 'bytes', so that we can't change that now.

The only difference may be that 'S' currently auto translates to a bytes
object, resulting in things like:

np.array(['some text',],  dtype='S')[0] == 'some text'

yielding False on Py3. And you can't do all the usual text stuff with the
resulting bytes object, either. (and it probably used the default encoding
to generate the bytes, so will barf on some inputs, though that may be
unavoidable.) So you need to decode the bytes that are given back, and now
that I think about it, I have no idea what encoding you'd need to use in
the general case.

So the correct solution is (particularly on py3) to use the 'U' (unicode)
dtype for text in numpy arrays.

However, the 'U' dtype is 4 bytes per character, and that may be "too big"
for some use-cases. And there is a lot of text in scientific data sets that
are pure ascii, or at least some 1-byte-per-character encoding.

So, in the spirit of having multiple numeric types that use different
amounts of memory, and can hold different ranges of values, a one-byte-per
character dtype would be nice:

(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally
don't think that's worth it, but maybe that's because I'm an english
speaker...)

It could use the 's' (lower-case s) type identifier.

For passing to/from python built-in objects, it would

* Allow either Python bytes objects or Python unicode objects as input
 a) bytes objects would be passed through as-is
 b) unicode objects would be encoded as latin-1

[note: I'm not entirely sure that bytes objects should be allowed, but it
would provide an nice efficiency in a fairly common case]

* It would create python unicode text objects, decoded as latin-1.

Could we have a way to specify another encoding? I'm not sure how that
would fit into the dtype system.

I've explained the latin-1 thing on other threads, but the short version is:

 - It will work perfectly for ascii text
 - It will work perfectly for latin-1 text (natch)
 - It will never give you an UnicodeEncodeError regardless of what
arbitrary bytes you pass in.
 - It will preserve those arbitrary bytes through a encoding/decoding
operation.

(it still wouldn't allow you to store arbitrary unicode -- but that's the
limitation of one-byte per character...)

So:

Bad idea all around: shut up already!

or

Fine idea, but who's going to write the code? not me!

or

We really should do this.

(of course, with the options of amending the above not-very-fleshed out
proposal)

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion