Re: [Numpy-discussion] Text array dtype for numpy

2014-01-25 Thread Oscar Benjamin
On 24 January 2014 22:43, Chris Barker chris.bar...@noaa.gov wrote:
 Oscar,

 Cool stuff, thanks!

 I'm wondering though what the use-case really is.

The use-case is precisely the use-case for dtype='S' on Py2 except
that it also works on Py3.

 The P3 text  model
 (actually the py2 one, too), is quite clear that you want users to think of,
 and work with, text as text -- and not care how things are encoding in the
 underlying implementation. You only want the user to think about encodings
 on I/O -- transferring stuff between systems where you can't avoid it. And
 you might choose different encodings based on different needs.

Exactly. But what you're missing is that storing text in a numpy array
is putting the text into bytes and the encoding needs to be specified.
My proposal involves explicitly specifying the encoding. This is the
key point about the Python 3 text model: it is not that encoding isn't
automatic (e.g. when you print() or call file.write with a text file);
the point is that there must never be ambiguity about the encoding
that is used when encode/decode occurs.

 So why have a different, the-user-needs-to-think-about-encodings numpy
 dtype? We already have 'U' for full-on unicode support for text. There is a
 good argument for a more compact internal representation for text compatible
 with one-byte-per-char encoding, thus the suggestion for such a dtype. But I
 don't see the need for quite this. Maybe I'm not being a creative enough
 thinker.

Because users want to store text in a numpy array and use less than 4
bytes per character. You expressed a desire for this. The only
difference between this and your latin-1 suggestion is that this one
has an explicit encoding that is visible to the user and that you can
choose that encoding to be anything that your Python installation
supports.

 Also, we may want numpy to interact at a low level with other libs that
 might have binary encoded text (HDF, etc) -- in which case we need a bytes
 dtype that can store that data, and perhaps encoding and decoding ufuncs.

Perhaps there is a need for a bytes dtype as well. But not that you
can use textarray with encoding='ascii' to satisfy many of these use
cases. So h5py and pytables can expose an interface that stores text
as bytes but has a clearly labelled (and enforced) encoding.

 If we want a more efficient and compact unicode implementation  then the py3
 one is a good  place to start -it's pretty slick! Though maybe harder to due
 in numpy as text in numpy probably wouldn't be immutable.

It's not a good fit for numpy because numpy arrays expose their memory
buffer. More on this below but if there was to be something as drastic
as the FSR then it would be better to think about how to make an
ndarray type that is completely different, has an opaque memory buffer
and can handle arbitrary length text strings.

 To make a slightly more concrete proposal, I've implemented a pure
 Python ndarray subclass that I believe can consistently handle
 text/bytes in Python 3.

 this scares me right there -- is it text or bytes??? We really don't want
 something that is both.

I believe that there is a conceptual misunderstanding about what a
numpy array is here.

A numpy array is a clever view onto a memory buffer. A numpy array
always has two interfaces, one that describes a memory buffer and one
that delivers Python objects representing the abstract quantities
described by each portion of the memory buffer. The dtype specifies
three things:
1) How many bytes of the buffer are used.
2) What kind of abstract object this part of the buffer represents.
3) The mapping from the bytes in this segment of the buffer to the
abstract object.

As an example:

 import numpy as np
 a = np.array([1, 2, 3], dtype='u4')
 a
array([1, 2, 3], dtype=uint32)
 a.tostring()
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

So what is this array? Is it bytes or is it integers? It is both. The
array is a view onto a memory buffer and the dtype is the encoding
that describes the meaning of the bytes in different segments. In this
case the dtype is 'u4'. This tells us that we need 4 bytes per
segment, that each segment represents an integer and that the mapping
from byte segments to integers is the unsigned little-endian mapping.

How can we do the same thing with text? We need a way to map text to
fixed-width bytes. Mapping text to bytes is done with text encodings.
So we need a dtype that incorporates a text encoding in order to
define the relationship between the bytes in the array's memory buffer
and the abstract entity that is a sequence of Unicode characters.
Using dtype='U' doesn't get around this:

 a = np.array(['qwe'], dtype='U')
 a
array(['qwe'],
  dtype='U3')
 a[0] # text
'qwe'
 a.tostring() # bytes
b'q\x00\x00\x00w\x00\x00\x00e\x00\x00\x00'

In my proposal you'd get the same by using 'utf-32-le' as the encoding
for your text array.

 The idea is that the array has an encoding. It stores strings as
 bytes. The 

Re: [Numpy-discussion] Text array dtype for numpy

2014-01-24 Thread Chris Barker
Oscar,

Cool stuff, thanks!

I'm wondering though what the use-case really is. The P3 text  model
(actually the py2 one, too), is quite clear that you want users to think
of, and work with, text as text -- and not care how things are encoding in
the underlying implementation. You only want the user to think about
encodings on I/O -- transferring stuff between systems where you can't
avoid it. And you might choose different encodings based on different needs.

So why have a different, the-user-needs-to-think-about-encodings numpy
 dtype? We already have 'U' for full-on unicode support for text. There is
a good argument for a more compact internal representation for text
compatible with one-byte-per-char encoding, thus the suggestion for such a
dtype. But I don't see the need for quite this. Maybe I'm not being a
creative enough thinker.

Also, we may want numpy to interact at a low level with other libs that
might have binary encoded text (HDF, etc) -- in which case we need a bytes
dtype that can store that data, and perhaps encoding and decoding ufuncs.

If we want a more efficient and compact unicode implementation  then the
py3 one is a good  place to start -it's pretty slick! Though maybe harder
to due in numpy as text in numpy probably wouldn't be immutable.

To make a slightly more concrete proposal, I've implemented a pure
 Python ndarray subclass that I believe can consistently handle
 text/bytes in Python 3.


this scares me right there -- is it text or bytes??? We really don't want
something that is both.


 The idea is that the array has an encoding. It stores strings as
 bytes. The bytes are encoded/decoded on insertion/access. Methods
 accessing the binary content of the array will see the encoded bytes.
 Methods accessing the elements of the array will see unicode strings.

 I believe it would not be as hard to implement as the proposals for
 variable length string arrays.


except that with some encodings, the number of bytes required is a function
of what the content of teh text is -- so it either has to be variable
length, or a fixed number of bytes, which is not a fixed number
of characters  which require both careful truncation (a pain), and
surprising results for users  why can't I fit 10 characters is a length-10
text object? And I can if they are different characters?)


 The one caveat is that it will strip
 null characters from the end of any string.


which is fatal, but you do want a new dtype after all, which presumably
wouldn't do that.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Text array dtype for numpy

2014-01-24 Thread josef . pktd
On Fri, Jan 24, 2014 at 5:43 PM, Chris Barker chris.bar...@noaa.gov wrote:
 Oscar,

 Cool stuff, thanks!

 I'm wondering though what the use-case really is. The P3 text  model
 (actually the py2 one, too), is quite clear that you want users to think of,
 and work with, text as text -- and not care how things are encoding in the
 underlying implementation. You only want the user to think about encodings
 on I/O -- transferring stuff between systems where you can't avoid it. And
 you might choose different encodings based on different needs.

 So why have a different, the-user-needs-to-think-about-encodings numpy
 dtype? We already have 'U' for full-on unicode support for text. There is a
 good argument for a more compact internal representation for text compatible
 with one-byte-per-char encoding, thus the suggestion for such a dtype. But I
 don't see the need for quite this. Maybe I'm not being a creative enough
 thinker.

In my opinion something like Oscar's class would be very useful (with
some adjustments, especially making it easy to create an S view or put
a encoding view on top of an S array).

(Disclaimer: My only experience is in converting some examples in
statsmodels to bytes in py 3 and to play with some examples.)

My guess is that 'S'/bytes is very convenient for library code,
because it doesn't care about encodings (assuming we have enough
control that all bytes are in the same encoding), and we don't have
any overhead to convert to strings when comparing or working with
byte strings.
'S' is also very flexible because it doesn't tie us down to a minimum
size for the encoding nor any specific encoding.

The problem of 'S'/bytes is in input output and interactive work, as
in the examples of Tom Aldcroft. The textarray dtype would allow us to
view any 'S' array so we can have text/string interaction with python
and get the correct encoding on input and output.

Whether you live in an ascii, latin1, cp1252, iso8859_5 or in any
other world, you could get your favorite minimal memory
S/bytes/strings.

I think this is useful as a complement to the current 'S' type, and to
make that more useful on python 3, independent of what other small
memory unicode dtype with predefined encoding numpy could get.


 Also, we may want numpy to interact at a low level with other libs that
 might have binary encoded text (HDF, etc) -- in which case we need a bytes
 dtype that can store that data, and perhaps encoding and decoding ufuncs.

 If we want a more efficient and compact unicode implementation  then the py3
 one is a good  place to start -it's pretty slick! Though maybe harder to due
 in numpy as text in numpy probably wouldn't be immutable.

 To make a slightly more concrete proposal, I've implemented a pure
 Python ndarray subclass that I believe can consistently handle
 text/bytes in Python 3.


 this scares me right there -- is it text or bytes??? We really don't want
 something that is both.

Most users won't care about the internal representation of anything.
But when we want or find it useful we can view the memory with any
compatible dtype. That is, with numpy we always have also raw bytes.
And there are lot's of ways to shoot yourself

why would you want to to that? :
 a = np.arange(5)
 b = a.view('S4')
 b[1] = 'h'
 a
array([  0, 104,   2,   3,   4])

 a[1] = 'h'
Traceback (most recent call last):
  File pyshell#22, line 1, in module
a[1] = 'h'
ValueError: invalid literal for int() with base 10: 'h'



 The idea is that the array has an encoding. It stores strings as
 bytes. The bytes are encoded/decoded on insertion/access. Methods
 accessing the binary content of the array will see the encoded bytes.
 Methods accessing the elements of the array will see unicode strings.

 I believe it would not be as hard to implement as the proposals for
 variable length string arrays.


 except that with some encodings, the number of bytes required is a function
 of what the content of teh text is -- so it either has to be variable
 length, or a fixed number of bytes, which is not a fixed number of
 characters  which require both careful truncation (a pain), and surprising
 results for users  why can't I fit 10 characters is a length-10 text
 object? And I can if they are different characters?)

not really different to other places where you have to pay attention
to the underlying dtype, and a question of providing the underlying
information. (like itemsize)

1 - 1e-20 I had code like that when I wasn't thinking properly or
wasn't paying enough attention to what I was typing.



 The one caveat is that it will strip
 null characters from the end of any string.


 which is fatal, but you do want a new dtype after all, which presumably
 wouldn't do that.

The only place so far that I found where this really hurts is in the
decode examples (with utf32LE for example).
That's why I think numpy needs to have decode/encode functions, so it
can access the bytes before they are null truncated, besides being

[Numpy-discussion] Text array dtype for numpy

2014-01-23 Thread Oscar Benjamin
There have been a few threads discussing the problems of how to do
text with numpy arrays in Python 3.

To make a slightly more concrete proposal, I've implemented a pure
Python ndarray subclass that I believe can consistently handle
text/bytes in Python 3. It is intended to be an illustration since I
think that the real solution is a new dtype rather than an array
subclass (so that it can be used in e.g. record arrays).

The idea is that the array has an encoding. It stores strings as
bytes. The bytes are encoded/decoded on insertion/access. Methods
accessing the binary content of the array will see the encoded bytes.
Methods accessing the elements of the array will see unicode strings.

I believe it would not be as hard to implement as the proposals for
variable length string arrays. The one caveat is that it will strip
null characters from the end of any string. I'm not 100% that the byte
stripping encoding function will always work but it will for all the
encodings I know and it seems to work with all the encodings that
Python has.

The code is inline below and attached (in case there are encoding
problems with this message!):

Oscar

#!/usr/bin/env python3

from numpy import ndarray, array

class textarray(ndarray):
'''ndarray for holding encoded text.

This is for demonstration purposes only. The real proposal
is to specify the encoding as a dtype rather than a subclass.

Only works as a 1-d array.

 a = textarray(['qwert', 'zxcvb'], encoding='ascii')
 a
textarray(['qwert', 'zxcvb'],
  dtype='|S5:ascii')
 a[0]
'qwert'
 a.tostring()
b'qwertzxcvb'

 a[0] = 'qwe'  # shorter string
 a[0]
'qwe'
 a.tostring()
b'qwe\\x00\\x00zxcvb'

 a[0] = 'qwertyuiop'  # longer string
Traceback (most recent call last):
...
ValueError: Encoded bytes don't fit

 b = textarray(['Õscar', 'qwe'], encoding='utf-8')
 b
textarray(['Õscar', 'qwe'],
  dtype='|S6:utf-8')
 b[0]
'Õscar'
 b[0].encode('utf-8')
b'\\xc3\\x95scar'
 b.tostring()
b'\\xc3\\x95scarqwe\\x00\\x00\\x00'

 c = textarray(['qwe'], encoding='utf-32-le')
 c
textarray(['qwe'],
  dtype='|S12:utf-32-le')

'''
def __new__(cls, strings, encoding='utf-8'):
bytestrings = [s.encode(encoding) for s in strings]
a = array(bytestrings, dtype='S').view(textarray)
a.encoding = encoding
return a

def __repr__(self):
slist = ', '.join(repr(self[n]) for n in range(len(self)))
return textarray([%s], \n  dtype='|S%d:%s')\
   % (slist, self.itemsize, self.encoding)

def __getitem__(self, index):
bstring = ndarray.__getitem__(self, index)
return self._decode(bstring)

def __setitem__(self, index, string):
bstring = string.encode(self.encoding)
if len(bstring)  self.itemsize:
raise ValueError(Encoded bytes don't fit)
ndarray.__setitem__(self, index, bstring)

def _decode(self, b):
b = b + b'\0' * (4 - len(b) % 4)
s = b.decode(self.encoding)
for n, c in enumerate(reversed(s)):
if c != '\0':
return s[:len(s)-n]
return s

if __name__ == __main__:
import doctest
doctest.testmod()
#!/usr/bin/env python3

from numpy import ndarray, array

class textarray(ndarray):
'''ndarray for holding encoded text.

This is for demonstration purposes only. The real proposal
is to specify the encoding as a dtype rather than a subclass.

Only works as a 1-d array.

 a = textarray(['qwert', 'zxcvb'], encoding='ascii')
 a
textarray(['qwert', 'zxcvb'], 
  dtype='|S5:ascii')
 a[0]
'qwert'
 a.tostring()
b'qwertzxcvb'

 a[0] = 'qwe'  # shorter string
 a[0]
'qwe'
 a.tostring()
b'qwe\\x00\\x00zxcvb'

 a[0] = 'qwertyuiop'  # longer string
Traceback (most recent call last):
...
ValueError: Encoded bytes don't fit

 b = textarray(['Õscar', 'qwe'], encoding='utf-8')
 b
textarray(['Õscar', 'qwe'], 
  dtype='|S6:utf-8')
 b[0]
'Õscar'
 b[0].encode('utf-8')
b'\\xc3\\x95scar'
 b.tostring()
b'\\xc3\\x95scarqwe\\x00\\x00\\x00'

 c = textarray(['qwe'], encoding='utf-32-le')
 c
textarray(['qwe'], 
  dtype='|S12:utf-32-le')

'''
def __new__(cls, strings, encoding='utf-8'):
bytestrings = [s.encode(encoding) for s in strings]
a = array(bytestrings, dtype='S').view(textarray)
a.encoding = encoding
return a

def __repr__(self):
slist = ', '.join(repr(self[n]) for n in range(len(self)))
return textarray([%s], \n  dtype='|S%d:%s')\
   % (slist, self.itemsize, self.encoding)

def __getitem__(self, index):
bstring = ndarray.__getitem__(self, index)
return self._decode(bstring)

def __setitem__(self, index, string):