Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Oscar Benjamin
On Mon, Jan 20, 2014 at 04:12:20PM -0700, Charles R Harris wrote:
 On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com 
 wrote:
  On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote:
  On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris 
  charlesr.har...@gmail.com wrote:
  
   I didn't say we should change the S type, but that we should have
  something,
   say 's', that appeared to python as a string. I think if we want
  transparent
   string interoperability with python together with a compressed
   representation, and I think we need both, we are going to have to deal
  with
   the difficulties of utf-8. That means raising errors if the string
  doesn't
   fit in the allotted size, etc. Mind, this is a workaround for the mass
  of
   ascii data that is already out there, not a substitute for 'U'.
 
  If we're going to be taking that much trouble, I'd suggest going ahead
  and adding a variable-length string type (where the array itself
  contains a pointer to a lookaside buffer, maybe with an optimization
  for stashing short strings directly). The fixed-length requirement is
  pretty onerous for lots of applications (e.g., pandas always uses
  dtype=O for strings -- and that might be a good workaround for some
  people in this thread for now). The use of a lookaside buffer would
  also make it practical to resize the buffer when the maximum code
  point changed, for that matter...
 
 The more I think about it, the more I think we may need to do that. Note
 that dynd has ragged arrays and I think they are implemented as pointers to
 buffers. The easy way for us to do that would be a specialization of object
 arrays to string types only as you suggest.

This wouldn't necessarily help for the gigarows of short text strings use case
(depending on what short means). Also even if it technically saves memory
you may have a greater overhead from fragmenting your array all over the heap.

On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII
characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory
saving over dtype='U' only if the strings are 17 characters or more. To get a
50% saving over dtype='U' you'd need strings of at least 49 characters.

If the Numpy array would manage the buffers itself then that per string memory
overhead would be eliminated in exchange for an 8 byte pointer and at least 1
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives an
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.

Using utf-8 in the buffers eliminates the need to go around checking maximum
code points etc. so I would guess that would be simpler to implement (CPython
has now had to triple all of it's code paths that actually access the string
buffer).


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Nathaniel Smith
On 21 Jan 2014 11:13, Oscar Benjamin oscar.j.benja...@gmail.com wrote:
 If the Numpy array would manage the buffers itself then that per string
memory
 overhead would be eliminated in exchange for an 8 byte pointer and at
least 1
 byte to represent the length of the string (assuming you can somehow use
 Pascal strings when short enough - null bytes cannot be used). This gives
an
 overhead of 9 bytes per string (or 5 on 32 bit). In this case you save
memory
 if the strings are more than 3 characters long and you get at least a 50%
 saving for strings longer than 9 characters.

There are various optimisations possible as well.

For ASCII strings of up to length 8, one could also use tagged pointers to
eliminate the lookaside buffer entirely. (Alignment rules mean that
pointers to allocated buffers always have the low bits zero; so you can
make a rule that if the low bit is set to one, then this means the
pointer itself should be interpreted as containing the string data; use
the spare bit in the other bytes to encode the length.)

In some cases it may also make sense to let identical strings share
buffers, though this adds some overhead for reference counting and
interning.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Creating an ndarray from an iterable over sequences

2014-01-21 Thread eat
Hi,


On Tue, Jan 21, 2014 at 8:34 AM, Dr. Leo fhaxbo...@googlemail.com wrote:

 Hi,

 I would like to write something like:

 In [25]: iterable=((i, i**2) for i in range(10))

 In [26]: a=np.fromiter(iterable, int32)
 ---
 ValueErrorTraceback (most recent call
 last)
 ipython-input-26-5bcc2e94dbca in module()
  1 a=np.fromiter(iterable, int32)

 ValueError: setting an array element with a sequence.


 Is there an efficient way to do this?

Perhaps you could just utilize structured arrays (
http://docs.scipy.org/doc/numpy/user/basics.rec.html), like:
iterable= ((i, i**2) for i in range(10))
a= np.fromiter(iterable, [('a', int32), ('b', int32)], 10)
a.view(int32).reshape(-1, 2)
Out[]:
array([[ 0,  0],
   [ 1,  1],
   [ 2,  4],
   [ 3,  9],
   [ 4, 16],
   [ 5, 25],
   [ 6, 36],
   [ 7, 49],
   [ 8, 64],
   [ 9, 81]])

My 2 cents,
-eat


 Creating two 1-dimensional arrays first is costly as one has to
 iterate twice over the data. So the only way I see is creating an
 empty [10,2] array and filling it row by row. This is memory-efficient
 but slow. List comprehension is vice versa.

 If there is no solution, wouldn't it be possible to rewrite fromiter
 so as to accept sequences?

 Leo

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Creating an ndarray from an iterable over sequences

2014-01-21 Thread Oscar Benjamin
On Tue, Jan 21, 2014 at 07:34:19AM +0100, Dr. Leo wrote:
 Hi,
 
 I would like to write something like:
 
 In [25]: iterable=((i, i**2) for i in range(10))
 
 In [26]: a=np.fromiter(iterable, int32)
 ---
 ValueErrorTraceback (most recent call
 last)
 ipython-input-26-5bcc2e94dbca in module()
  1 a=np.fromiter(iterable, int32)
 
 ValueError: setting an array element with a sequence.
 
 
 Is there an efficient way to do this?
 
 Creating two 1-dimensional arrays first is costly as one has to
 iterate twice over the data. So the only way I see is creating an
 empty [10,2] array and filling it row by row. This is memory-efficient
 but slow. List comprehension is vice versa.

You could use itertools:

 from itertools import chain
 g = ((i, i**2) for i in range(10))
 import numpy
 numpy.fromiter(chain.from_iterable(g), numpy.int32).reshape(-1, 2)
array([[ 0,  0],
   [ 1,  1],
   [ 2,  4],
   [ 3,  9],
   [ 4, 16],
   [ 5, 25],
   [ 6, 36],
   [ 7, 49],
   [ 8, 64],
   [ 9, 81]], dtype=int32)


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Oscar Benjamin
On Tue, Jan 21, 2014 at 11:41:30AM +, Nathaniel Smith wrote:
 On 21 Jan 2014 11:13, Oscar Benjamin oscar.j.benja...@gmail.com wrote:
  If the Numpy array would manage the buffers itself then that per string
 memory
  overhead would be eliminated in exchange for an 8 byte pointer and at
 least 1
  byte to represent the length of the string (assuming you can somehow use
  Pascal strings when short enough - null bytes cannot be used). This gives
 an
  overhead of 9 bytes per string (or 5 on 32 bit). In this case you save
 memory
  if the strings are more than 3 characters long and you get at least a 50%
  saving for strings longer than 9 characters.
 
 There are various optimisations possible as well.
 
 For ASCII strings of up to length 8, one could also use tagged pointers to
 eliminate the lookaside buffer entirely. (Alignment rules mean that
 pointers to allocated buffers always have the low bits zero; so you can
 make a rule that if the low bit is set to one, then this means the
 pointer itself should be interpreted as containing the string data; use
 the spare bit in the other bytes to encode the length.)
 
 In some cases it may also make sense to let identical strings share
 buffers, though this adds some overhead for reference counting and
 interning.

Would this new dtype have an opaque memory representation? What would happen
in the following:

 a = numpy.array(['CGA', 'GAT'], dtype='s')

 memoryview(a)

 with open('file', 'wb') as fout:
... a.tofile(fout)

 with open('file', 'rb') as fin:
... a = numpy.fromfile(fin, dtype='s')

Should there be a different function for creating such an array from reading a
text file? Or would you just need to use fromiter:

 with open('file', encoding='utf-8') as fin:
... a = numpy.fromiter(fin, dtype='s')

 with open('file', encoding='utf-8') as fout:
... fout.writelines(line + '\n' for line in a)

(Note that the above would not be reversible if the strings contain newlines)

I think it Would be less confusing to use dtype='u' than dtype='U' in order to
signify that it is an optimised form of the 'U' dtype as far as access from
Python code is concerned? Calling it 's' only really makes sense if there is a
plan to deprecate dtype='S'.

How would it behave in Python 2? Would it return unicode strings there as
well?


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Aldcroft, Thomas
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris charlesr.har...@gmail.com
 wrote:




 On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
 
  On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin 
 oscar.j.benja...@gmail.com
  wrote:
 
 
  On Jan 20, 2014 8:35 PM, Charles R Harris 
 charlesr.har...@gmail.com
  wrote:
  
   I think we may want something like PEP 393. The S datatype may be
 the
   wrong place to look, we might want a modification of U instead so
 as to
   transparently get the benefit of python strings.
 
  The approach taken in PEP 393 (the FSR) makes more sense for str than
 it
  does for numpy arrays for two reasons: str is immutable and opaque.
 
  Since str is immutable the maximum code point in the string can be
  determined once when the string is created before anything else can
 get a
  pointer to the string buffer.
 
  Since it is opaque no one can rightly expect it to expose a particular
  binary format so it is free to choose without compromising any
 expected
  semantics.
 
  If someone can call buffer on an array then the FSR is a semantic
 change.
 
  If a numpy 'U' array used the FSR and consisted only of ASCII
 characters
  then it would have a one byte per char buffer. What then happens if
 you put
  a higher code point in? The buffer needs to be resized and the data
 copied
  over. But then what happens to any buffer objects or array views?
 They would
  be pointing at the old buffer from before the resize. Subsequent
  modifications to the resized array would not show up in other views
 and vice
  versa.
 
  I don't think that this can be done transparently since users of a
 numpy
  array need to know about the binary representation. That's why I
 suggest a
  dtype that has an encoding. Only in that way can it consistently have
 both a
  binary and a text interface.
 
 
  I didn't say we should change the S type, but that we should have
 something,
  say 's', that appeared to python as a string. I think if we want
 transparent
  string interoperability with python together with a compressed
  representation, and I think we need both, we are going to have to deal
 with
  the difficulties of utf-8. That means raising errors if the string
 doesn't
  fit in the allotted size, etc. Mind, this is a workaround for the mass
 of
  ascii data that is already out there, not a substitute for 'U'.

 If we're going to be taking that much trouble, I'd suggest going ahead
 and adding a variable-length string type (where the array itself
 contains a pointer to a lookaside buffer, maybe with an optimization
 for stashing short strings directly). The fixed-length requirement is
 pretty onerous for lots of applications (e.g., pandas always uses
 dtype=O for strings -- and that might be a good workaround for some
 people in this thread for now). The use of a lookaside buffer would
 also make it practical to resize the buffer when the maximum code
 point changed, for that matter...


 The more I think about it, the more I think we may need to do that. Note
 that dynd has ragged arrays and I think they are implemented as pointers to
 buffers. The easy way for us to do that would be a specialization of object
 arrays to string types only as you suggest.


Is this approach intended to be in *addition to* the latin-1 s type
originally proposed by Chris, or *instead of* that?

- Tom



 snip

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Charles R Harris
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas 
aldcr...@head.cfa.harvard.edu wrote:




 On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
 
  On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin 
 oscar.j.benja...@gmail.com
  wrote:
 
 
  On Jan 20, 2014 8:35 PM, Charles R Harris 
 charlesr.har...@gmail.com
  wrote:
  
   I think we may want something like PEP 393. The S datatype may be
 the
   wrong place to look, we might want a modification of U instead so
 as to
   transparently get the benefit of python strings.
 
  The approach taken in PEP 393 (the FSR) makes more sense for str
 than it
  does for numpy arrays for two reasons: str is immutable and opaque.
 
  Since str is immutable the maximum code point in the string can be
  determined once when the string is created before anything else can
 get a
  pointer to the string buffer.
 
  Since it is opaque no one can rightly expect it to expose a
 particular
  binary format so it is free to choose without compromising any
 expected
  semantics.
 
  If someone can call buffer on an array then the FSR is a semantic
 change.
 
  If a numpy 'U' array used the FSR and consisted only of ASCII
 characters
  then it would have a one byte per char buffer. What then happens if
 you put
  a higher code point in? The buffer needs to be resized and the data
 copied
  over. But then what happens to any buffer objects or array views?
 They would
  be pointing at the old buffer from before the resize. Subsequent
  modifications to the resized array would not show up in other views
 and vice
  versa.
 
  I don't think that this can be done transparently since users of a
 numpy
  array need to know about the binary representation. That's why I
 suggest a
  dtype that has an encoding. Only in that way can it consistently
 have both a
  binary and a text interface.
 
 
  I didn't say we should change the S type, but that we should have
 something,
  say 's', that appeared to python as a string. I think if we want
 transparent
  string interoperability with python together with a compressed
  representation, and I think we need both, we are going to have to
 deal with
  the difficulties of utf-8. That means raising errors if the string
 doesn't
  fit in the allotted size, etc. Mind, this is a workaround for the
 mass of
  ascii data that is already out there, not a substitute for 'U'.

 If we're going to be taking that much trouble, I'd suggest going ahead
 and adding a variable-length string type (where the array itself
 contains a pointer to a lookaside buffer, maybe with an optimization
 for stashing short strings directly). The fixed-length requirement is
 pretty onerous for lots of applications (e.g., pandas always uses
 dtype=O for strings -- and that might be a good workaround for some
 people in this thread for now). The use of a lookaside buffer would
 also make it practical to resize the buffer when the maximum code
 point changed, for that matter...


 The more I think about it, the more I think we may need to do that. Note
 that dynd has ragged arrays and I think they are implemented as pointers to
 buffers. The easy way for us to do that would be a specialization of object
 arrays to string types only as you suggest.


 Is this approach intended to be in *addition to* the latin-1 s type
 originally proposed by Chris, or *instead of* that?


Well, that's open for discussion. The problem is to have something that is
both compact (latin-1) and interoperates transparently with python 3
strings (utf-8). A latin-1 type would be easier to implement and would
probably be a better choice for something available in both python 2 and
python 3, but unless the python 3 developers come up with something clever
I don't  see how to make it behave transparently as a string in python 3.
OTOH, it's not clear to me how to make utf-8 operate transparently with
python 2 strings, especially as the unicode representation choices in
python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
is unlikely to be backported. The problem may be unsolvable in a completely
satisfactory way.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Aldcroft, Thomas
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris charlesr.har...@gmail.com
 wrote:




 On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas 
 aldcr...@head.cfa.harvard.edu wrote:




 On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote:

 On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
 
  On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin 
 oscar.j.benja...@gmail.com
  wrote:
 
 
  On Jan 20, 2014 8:35 PM, Charles R Harris 
 charlesr.har...@gmail.com
  wrote:
  
   I think we may want something like PEP 393. The S datatype may be
 the
   wrong place to look, we might want a modification of U instead so
 as to
   transparently get the benefit of python strings.
 
  The approach taken in PEP 393 (the FSR) makes more sense for str
 than it
  does for numpy arrays for two reasons: str is immutable and opaque.
 
  Since str is immutable the maximum code point in the string can be
  determined once when the string is created before anything else can
 get a
  pointer to the string buffer.
 
  Since it is opaque no one can rightly expect it to expose a
 particular
  binary format so it is free to choose without compromising any
 expected
  semantics.
 
  If someone can call buffer on an array then the FSR is a semantic
 change.
 
  If a numpy 'U' array used the FSR and consisted only of ASCII
 characters
  then it would have a one byte per char buffer. What then happens if
 you put
  a higher code point in? The buffer needs to be resized and the data
 copied
  over. But then what happens to any buffer objects or array views?
 They would
  be pointing at the old buffer from before the resize. Subsequent
  modifications to the resized array would not show up in other views
 and vice
  versa.
 
  I don't think that this can be done transparently since users of a
 numpy
  array need to know about the binary representation. That's why I
 suggest a
  dtype that has an encoding. Only in that way can it consistently
 have both a
  binary and a text interface.
 
 
  I didn't say we should change the S type, but that we should have
 something,
  say 's', that appeared to python as a string. I think if we want
 transparent
  string interoperability with python together with a compressed
  representation, and I think we need both, we are going to have to
 deal with
  the difficulties of utf-8. That means raising errors if the string
 doesn't
  fit in the allotted size, etc. Mind, this is a workaround for the
 mass of
  ascii data that is already out there, not a substitute for 'U'.

 If we're going to be taking that much trouble, I'd suggest going ahead
 and adding a variable-length string type (where the array itself
 contains a pointer to a lookaside buffer, maybe with an optimization
 for stashing short strings directly). The fixed-length requirement is
 pretty onerous for lots of applications (e.g., pandas always uses
 dtype=O for strings -- and that might be a good workaround for some
 people in this thread for now). The use of a lookaside buffer would
 also make it practical to resize the buffer when the maximum code
 point changed, for that matter...


 The more I think about it, the more I think we may need to do that. Note
 that dynd has ragged arrays and I think they are implemented as pointers to
 buffers. The easy way for us to do that would be a specialization of object
 arrays to string types only as you suggest.


 Is this approach intended to be in *addition to* the latin-1 s type
 originally proposed by Chris, or *instead of* that?


 Well, that's open for discussion. The problem is to have something that is
 both compact (latin-1) and interoperates transparently with python 3
 strings (utf-8). A latin-1 type would be easier to implement and would
 probably be a better choice for something available in both python 2 and
 python 3, but unless the python 3 developers come up with something clever
 I don't  see how to make it behave transparently as a string in python 3.
 OTOH, it's not clear to me how to make utf-8 operate transparently with
 python 2 strings, especially as the unicode representation choices in
 python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
 is unlikely to be backported. The problem may be unsolvable in a completely
 satisfactory way.


Since it's open for discussion, I'll put in my vote for implementing the
easier latin-1 version in the short term to facilitate Python 2 / 3
interoperability.  This would solve my use-case (giga-rows of short fixed
length strings), and presumably allow things like memory mapping of large
data files (like for FITS files in astropy.io.fits).

I don't have a clue how the current 'U' dtype works under the hood, but
from my user perspective it seems to work just fine in terms of interacting
with Python 3 

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Oscar Benjamin
On Tue, Jan 21, 2014 at 06:55:29AM -0700, Charles R Harris wrote:

 Well, that's open for discussion. The problem is to have something that is
 both compact (latin-1) and interoperates transparently with python 3
 strings (utf-8). A latin-1 type would be easier to implement and would
 probably be a better choice for something available in both python 2 and
 python 3, but unless the python 3 developers come up with something clever
 I don't  see how to make it behave transparently as a string in python 3.
 OTOH, it's not clear to me how to make utf-8 operate transparently with
 python 2 strings, especially as the unicode representation choices in
 python 2 are ucs-2 or ucs-4

On Python 2, unicode strings can operate transparently with byte strings:

$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 import numpy as bnp
 import numpy as np
 a = np.array([u'\xd5scar'], dtype='U')
 a
array([u'\xd5scar'], 
  dtype='U5')
 a[0]
u'\xd5scar'
 import sys
 sys.stdout.encoding
'UTF-8'
 print(a[0])  # Encodes as 'utf-8'
Õscar
 'My name is %s' % a[0]  # Decodes as ASCII
u'My name is \xd5scar'
 print('My name is %s' % a[0])  # Encodes as UTF-8
My name is Õscar

This is no better worse than the rest of the Py2 text model. So if the new
dtype always returns a unicode string under Py2 it should work (as well as the
Py2 text model ever does).

 and the python 3 work adding utf-16 and utf-8
 is unlikely to be backported. The problem may be unsolvable in a completely
 satisfactory way.

What do you mean by this? PEP 393 uses UCS-1/2/4 not utf-8/16/32 i.e. it
always uses a fixed-width encoding.

You can just use the CPython C-API to create the unicode strings. The simplest
way is probably use utf-8 internally and then call PyUnicode_DecodeUTF8 and
PyUnicode_EncodeUTF8 at the boundaries. This should work fine on Python 2.x
and 3.x. It obviates any need to think about pre-3.3 narrow and wide builds
and post-3.3 FSR formats.

Unlike Python's str there isn't much need to be able to efficiently slice or
index within the string array element. Indexing into the array to get the
string requires creating a new object, so you may as well just decode from
utf-8 at that point [it's big-O(num chars) either way]. There's no need to
constrain it to fixed-width encodings like the FSR in which case utf-8 is
clearly the best choice as:

1) It covers the whole unicode spectrum.
2) It uses 1 byte-per-char for ASCII.
3) UTF-8 is a big optimisation target for CPython (so it's fast).


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Charles R Harris
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas 
aldcr...@head.cfa.harvard.edu wrote:




 On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas 
 aldcr...@head.cfa.harvard.edu wrote:




 On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris 
 charlesr.har...@gmail.com wrote:




 On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.comwrote:

 On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
 
  On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin 
 oscar.j.benja...@gmail.com
  wrote:
 
 
  On Jan 20, 2014 8:35 PM, Charles R Harris 
 charlesr.har...@gmail.com
  wrote:
  
   I think we may want something like PEP 393. The S datatype may
 be the
   wrong place to look, we might want a modification of U instead
 so as to
   transparently get the benefit of python strings.
 
  The approach taken in PEP 393 (the FSR) makes more sense for str
 than it
  does for numpy arrays for two reasons: str is immutable and opaque.
 
  Since str is immutable the maximum code point in the string can be
  determined once when the string is created before anything else
 can get a
  pointer to the string buffer.
 
  Since it is opaque no one can rightly expect it to expose a
 particular
  binary format so it is free to choose without compromising any
 expected
  semantics.
 
  If someone can call buffer on an array then the FSR is a semantic
 change.
 
  If a numpy 'U' array used the FSR and consisted only of ASCII
 characters
  then it would have a one byte per char buffer. What then happens
 if you put
  a higher code point in? The buffer needs to be resized and the
 data copied
  over. But then what happens to any buffer objects or array views?
 They would
  be pointing at the old buffer from before the resize. Subsequent
  modifications to the resized array would not show up in other
 views and vice
  versa.
 
  I don't think that this can be done transparently since users of a
 numpy
  array need to know about the binary representation. That's why I
 suggest a
  dtype that has an encoding. Only in that way can it consistently
 have both a
  binary and a text interface.
 
 
  I didn't say we should change the S type, but that we should have
 something,
  say 's', that appeared to python as a string. I think if we want
 transparent
  string interoperability with python together with a compressed
  representation, and I think we need both, we are going to have to
 deal with
  the difficulties of utf-8. That means raising errors if the string
 doesn't
  fit in the allotted size, etc. Mind, this is a workaround for the
 mass of
  ascii data that is already out there, not a substitute for 'U'.

 If we're going to be taking that much trouble, I'd suggest going ahead
 and adding a variable-length string type (where the array itself
 contains a pointer to a lookaside buffer, maybe with an optimization
 for stashing short strings directly). The fixed-length requirement is
 pretty onerous for lots of applications (e.g., pandas always uses
 dtype=O for strings -- and that might be a good workaround for some
 people in this thread for now). The use of a lookaside buffer would
 also make it practical to resize the buffer when the maximum code
 point changed, for that matter...


 The more I think about it, the more I think we may need to do that.
 Note that dynd has ragged arrays and I think they are implemented as
 pointers to buffers. The easy way for us to do that would be a
 specialization of object arrays to string types only as you suggest.


 Is this approach intended to be in *addition to* the latin-1 s type
 originally proposed by Chris, or *instead of* that?


 Well, that's open for discussion. The problem is to have something that
 is both compact (latin-1) and interoperates transparently with python 3
 strings (utf-8). A latin-1 type would be easier to implement and would
 probably be a better choice for something available in both python 2 and
 python 3, but unless the python 3 developers come up with something clever
 I don't  see how to make it behave transparently as a string in python 3.
 OTOH, it's not clear to me how to make utf-8 operate transparently with
 python 2 strings, especially as the unicode representation choices in
 python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8
 is unlikely to be backported. The problem may be unsolvable in a completely
 satisfactory way.


 Since it's open for discussion, I'll put in my vote for implementing the
 easier latin-1 version in the short term to facilitate Python 2 / 3
 interoperability.  This would solve my use-case (giga-rows of short fixed
 length strings), and presumably allow things like memory mapping of large
 data files (like for FITS files in astropy.io.fits).

 I don't have a clue how the current 'U' dtype works under the 

Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Sebastian Berg
On Tue, 2014-01-21 at 07:48 -0700, Charles R Harris wrote:
 
 
 
 On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
 
 
 
 On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
 
 On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas
 aldcr...@head.cfa.harvard.edu wrote:
 
 
 
 On Mon, Jan 20, 2014 at 6:12 PM, Charles R
 Harris charlesr.har...@gmail.com wrote:
 
 
 
 On Mon, Jan 20, 2014 at 3:58 PM,
 Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
 
 On Mon, Jan 20, 2014 at 3:35
 PM, Nathaniel Smith
 n...@pobox.com wrote:
 On Mon, Jan 20, 2014
 at 10:28 PM, Charles R
 Harris
 charlesr.har...@gmail.com 
 wrote:
 
 
 
  On Mon, Jan 20, 2014
 at 2:27 PM, Oscar
 Benjamin
 oscar.j.benja...@gmail.com
  wrote:
 
 
  On Jan 20, 2014
 8:35 PM, Charles R
 Harris
 charlesr.har...@gmail.com
  wrote:
  
   I think we may
 want something like
 PEP 393. The S
 datatype may be the
   wrong place to
 look, we might want a
 modification of U
 instead so as to
   transparently get
 the benefit of python
 strings.
 
  The approach taken
 in PEP 393 (the FSR)
 makes more sense for
 str than it
  does for numpy
 arrays for two
 reasons: str is
 immutable and opaque.
 
  Since str is
 immutable the maximum
 code point in the
 string can be
  determined once
 when the string is
 created before
 anything else can get
 a
  pointer to the
 string buffer.
 
  Since it is opaque
 no one can rightly
 expect it to expose a
 particular
  binary format so it
  

Re: [Numpy-discussion] (no subject)

2014-01-21 Thread jennifer stone
 What are your interests and experience? If you use numpy, are there things
 you would like to fix, or enhancements you would like to see?

 Chuck


 I am an undergraduate student with CS as major and have interest in Math
and Physics. This has led me to use NumPy and SciPy to work on innumerable
cases involving special polynomial functions and polynomials like Legendre
polynomials, Bessel Functions and so on. So, The packages are closer known
to me from this point of view. I have a* few proposals* in mind. But I
don't have any idea if they are acceptable within the scope of GSoC
1. Many special functions and polynomials are neither included in NumPy nor
on SciPy.. These include Ellipsoidal Harmonic Functions (lames function),
Cylindrical Harmonic function. Scipy at present supports only spherical
Harmonic function.
Further, why cant we extend SciPy  to incorporate* Inverse Laplace
Transforms*? At present Matlab has this amazing function *ilaplace* and
SymPy does have *Inverse_Laplace_transform* but it would be better to
incorporate all in one package. I mean SciPy does have function to evaluate
laplace transform

After having written this, I feel that this post should have been sent to
SciPy
but as a majority of contributors are the same I proceed.
Please suggest any other possible projects, as I would like to continue
with SciPy or NumPy, preferably NumPy as I have been fiddling with its
source code for a month now and so am pretty comfortable with it.

As for my experience, I have known C for past 4 years and have been a
python lover for past 1 year. I am pretty new to open source communities,
started before a manth and a half.

regards
Jennifer
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] (no subject)

2014-01-21 Thread Charles R Harris
On Tue, Jan 21, 2014 at 9:26 AM, jennifer stone jenny.stone...@gmail.comwrote:


 What are your interests and experience? If you use numpy, are there things
 you would like to fix, or enhancements you would like to see?

 Chuck


  I am an undergraduate student with CS as major and have interest in Math
 and Physics. This has led me to use NumPy and SciPy to work on innumerable
 cases involving special polynomial functions and polynomials like Legendre
 polynomials, Bessel Functions and so on. So, The packages are closer known
 to me from this point of view. I have a* few proposals* in mind. But I
 don't have any idea if they are acceptable within the scope of GSoC
 1. Many special functions and polynomials are neither included in NumPy
 nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames
 function), Cylindrical Harmonic function. Scipy at present supports only
 spherical Harmonic function.
 Further, why cant we extend SciPy  to incorporate* Inverse Laplace
 Transforms*? At present Matlab has this amazing function *ilaplace* and
 SymPy does have *Inverse_Laplace_transform* but it would be better to
 incorporate all in one package. I mean SciPy does have function to evaluate
 laplace transform

 After having written this, I feel that this post should have been sent to
 SciPy
 but as a majority of contributors are the same I proceed.
 Please suggest any other possible projects, as I would like to continue
 with SciPy or NumPy, preferably NumPy as I have been fiddling with its
 source code for a month now and so am pretty comfortable with it.

 As for my experience, I have known C for past 4 years and have been a
 python lover for past 1 year. I am pretty new to open source communities,
 started before a manth and a half.


It does sound like scipy might be a better match, I don't think anyone
would complain if you cross posted. Both scipy and numpy require GSOC
candidates to have a pull request accepted as part of the application
process. I'd suggest implementing a function not currently in scipy that
you think would be useful. That would also help in finding a mentor for the
summer. I'd also suggest getting familiar with cython.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] (no subject)

2014-01-21 Thread Stéfan van der Walt
On Tue, 21 Jan 2014 21:56:17 +0530, jennifer stone wrote:
  I am an undergraduate student with CS as major and have interest in Math
 and Physics. This has led me to use NumPy and SciPy to work on innumerable
 cases involving special polynomial functions and polynomials like Legendre
 polynomials, Bessel Functions and so on. So, The packages are closer known
 to me from this point of view. I have a* few proposals* in mind. But I
 don't have any idea if they are acceptable within the scope of GSoC
 1. Many special functions and polynomials are neither included in NumPy nor
 on SciPy.. These include Ellipsoidal Harmonic Functions (lames function),
 Cylindrical Harmonic function. Scipy at present supports only spherical
 Harmonic function.

SciPy's spherical harmonics are very inefficient if one is only interested in
computing one specific order.  I'd be so happy if someone would work on that!

Stéfan

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] (no subject)

2014-01-21 Thread Charles R Harris
On Tue, Jan 21, 2014 at 9:46 AM, Charles R Harris charlesr.har...@gmail.com
 wrote:




 On Tue, Jan 21, 2014 at 9:26 AM, jennifer stone 
 jenny.stone...@gmail.comwrote:


 What are your interests and experience? If you use numpy, are there
 things
 you would like to fix, or enhancements you would like to see?

 Chuck


  I am an undergraduate student with CS as major and have interest in Math
 and Physics. This has led me to use NumPy and SciPy to work on innumerable
 cases involving special polynomial functions and polynomials like Legendre
 polynomials, Bessel Functions and so on. So, The packages are closer known
 to me from this point of view. I have a* few proposals* in mind. But I
 don't have any idea if they are acceptable within the scope of GSoC
 1. Many special functions and polynomials are neither included in NumPy
 nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames
 function), Cylindrical Harmonic function. Scipy at present supports only
 spherical Harmonic function.
 Further, why cant we extend SciPy  to incorporate* Inverse Laplace
 Transforms*? At present Matlab has this amazing function *ilaplace* and
 SymPy does have *Inverse_Laplace_transform* but it would be better to
 incorporate all in one package. I mean SciPy does have function to evaluate
 laplace transform

 After having written this, I feel that this post should have been sent to
 SciPy
 but as a majority of contributors are the same I proceed.
 Please suggest any other possible projects, as I would like to continue
 with SciPy or NumPy, preferably NumPy as I have been fiddling with its
 source code for a month now and so am pretty comfortable with it.

 As for my experience, I have known C for past 4 years and have been a
 python lover for past 1 year. I am pretty new to open source communities,
 started before a manth and a half.


 It does sound like scipy might be a better match, I don't think anyone
 would complain if you cross posted. Both scipy and numpy require GSOC
 candidates to have a pull request accepted as part of the application
 process. I'd suggest implementing a function not currently in scipy that
 you think would be useful. That would also help in finding a mentor for the
 summer. I'd also suggest getting familiar with cython.


I don't see you on github yet, are you there? If not, you should set up an
account to work in. See the developer guide
http://docs.scipy.org/doc/numpy/dev/for some pointers.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread David Goldsmith
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Nathaniel Smith
On 21 Jan 2014 17:28, David Goldsmith d.l.goldsm...@gmail.com wrote:


 Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's
own home on the Wiki?

Sounds plausible, perhaps you could write up such a page?

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Chris Barker
On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith d.l.goldsm...@gmail.comwrote:


 Am I the only one who feels that this (very important--I'm being sincere,
 not sarcastic) thread has matured and specialized enough to warrant it's
 own home on the Wiki?


Or  maybe a NEP?

https://github.com/numpy/numpy/tree/master/doc/neps

sorry -- really swamped this week, so I won't be writing it...

-Chris




-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread David Goldsmith
 Date: Tue, 21 Jan 2014 17:35:26 +
 From: Nathaniel Smith n...@pobox.com
 Subject: Re: [Numpy-discussion] A one-byte string dtype?
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Message-ID:
 CAPJVwB=+47ofYvnvN76=
 ke3xlga2+gz+qd4f0xs2uboeysg...@mail.gmail.com
 Content-Type: text/plain; charset=utf-8

 On 21 Jan 2014 17:28, David Goldsmith d.l.goldsm...@gmail.com wrote:
 
 
  Am I the only one who feels that this (very important--I'm being sincere,
 not sarcastic) thread has matured and specialized enough to warrant it's
 own home on the Wiki?

 Sounds plausible, perhaps you could write up such a page?

 -n


I can certainly get one started (but I don't think I can faithfully
summarize all this thread's current content, so I apologize in advance for
leaving that undone).

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Chris Barker
A  lot of good discussion here -- to much to comment individually, but it
seems we can boil it down to a couple somewhat distinct proposals:

1) a one-byte-per-char dtype:

This would provide compact, high efficiency storage for common text
for scientific computing. It is analogous to a lower-precision numeric type
-- i.e. it could not store any unicode strings -- only the subset that are
compatible the suggested encoding.
 Suggested encoding: latin-1
 Other options:
 - ascii only.
 - settable to any one-byte per char encoding supported by python
I like this IFF it's pretty easy, but it may
add significant complications (and overhead) for comparisons, etc

NOTE: This is NOT a way to conflate bytes and text, and not a way to go
back to the py2 mojibake hell -- the goal here is to very clearly have
this be text data, and have a clearly defined encoding. Which is why we
can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
to conveniently and efficiently use numpy for text that is ansi compatible.

2) a utf-8 dtype:
NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
per char encoding, so would not snuggly into the numpy data model.
   It would give compact memory use for mostly-ascii data, so that would be
nice.

3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
  This would get us the advantages of the new py3 unicode model -- compact
and efficient when it can be, but also supporting all of unicode. Honestly,
this seems like more work than it's worth to me, at least given the current
numpy dtype model -- maybe a nice addition to dynd. YOu can, after
all, simply use an object array with py3 strings in it. Though perhaps
using the py3 unicode type, but having a dtype that specifically links to
that, rather than a generic python object would be a good compromise.


Hmm -- I guess despite what I said, I just write the starting pint for a
NEP...

(or two, actually...)

-Chris

















On Tue, Jan 21, 2014 at 9:46 AM, Chris Barker chris.bar...@noaa.gov wrote:

 On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith 
 d.l.goldsm...@gmail.comwrote:


 Am I the only one who feels that this (very important--I'm being sincere,
 not sarcastic) thread has matured and specialized enough to warrant it's
 own home on the Wiki?


 Or  maybe a NEP?

 https://github.com/numpy/numpy/tree/master/doc/neps

 sorry -- really swamped this week, so I won't be writing it...

 -Chris




 --

 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov




-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Charles R Harris
On Tue, Jan 21, 2014 at 11:00 AM, Chris Barker chris.bar...@noaa.govwrote:

 A  lot of good discussion here -- to much to comment individually, but it
 seems we can boil it down to a couple somewhat distinct proposals:

 1) a one-byte-per-char dtype:

 This would provide compact, high efficiency storage for common text
 for scientific computing. It is analogous to a lower-precision numeric type
 -- i.e. it could not store any unicode strings -- only the subset that are
 compatible the suggested encoding.
  Suggested encoding: latin-1
  Other options:
  - ascii only.
  - settable to any one-byte per char encoding supported by python
 I like this IFF it's pretty easy, but it may
 add significant complications (and overhead) for comparisons, etc

 NOTE: This is NOT a way to conflate bytes and text, and not a way to go
 back to the py2 mojibake hell -- the goal here is to very clearly have
 this be text data, and have a clearly defined encoding. Which is why we
 can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
 to conveniently and efficiently use numpy for text that is ansi compatible.

 2) a utf-8 dtype:
 NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
 per char encoding, so would not snuggly into the numpy data model.
It would give compact memory use for mostly-ascii data, so that would
 be nice.

 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
   This would get us the advantages of the new py3 unicode model -- compact
 and efficient when it can be, but also supporting all of unicode. Honestly,
 this seems like more work than it's worth to me, at least given the current
 numpy dtype model -- maybe a nice addition to dynd. YOu can, after
 all, simply use an object array with py3 strings in it. Though perhaps
 using the py3 unicode type, but having a dtype that specifically links to
 that, rather than a generic python object would be a good compromise.


 Hmm -- I guess despite what I said, I just write the starting pint for a
 NEP...


Should also mention the reasons for adding a new data type.

snip

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread David Goldsmith
On Tue, Jan 21, 2014 at 10:00 AM, numpy-discussion-requ...@scipy.orgwrote:

 Date: Tue, 21 Jan 2014 09:53:25 -0800
 From: David Goldsmith d.l.goldsm...@gmail.com
 Subject: Re: [Numpy-discussion] A one-byte string dtype?
 To: numpy-discussion@scipy.org
 Message-ID:
 CAFtPsZqRrDxrshBMVyS+Z=
 7altpxmrz4miujy2xebyi_fy5...@mail.gmail.com
 Content-Type: text/plain; charset=iso-8859-1

  Date: Tue, 21 Jan 2014 17:35:26 +
  From: Nathaniel Smith n...@pobox.com
  Subject: Re: [Numpy-discussion] A one-byte string dtype?
  To: Discussion of Numerical Python numpy-discussion@scipy.org
  Message-ID:
  CAPJVwB=+47ofYvnvN76=
  ke3xlga2+gz+qd4f0xs2uboeysg...@mail.gmail.com
  Content-Type: text/plain; charset=utf-8
 
  On 21 Jan 2014 17:28, David Goldsmith d.l.goldsm...@gmail.com wrote:
  
  
   Am I the only one who feels that this (very important--I'm being
 sincere,
  not sarcastic) thread has matured and specialized enough to warrant it's
  own home on the Wiki?
 
  Sounds plausible, perhaps you could write up such a page?
 
  -n
 

 I can certainly get one started (but I don't think I can faithfully
 summarize all this thread's current content, so I apologize in advance for
 leaving that undone).

 DG


OK, I'm lost already: is there general agreement that this should jump
straight to one or more NEP's?  If not (or if there should be a Wiki page
for it additionally), should such become part of the NumPy Wiki @
Sourceforge or the SciPy Wiki at the scipy.org site?  If the latter, is
one's SciPy Wiki login the same as one's mailing list subscriber
maintenance login?  I guess starting such a page is not as trivial as I had
assumed.

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Robert Kern
On Tue, Jan 21, 2014 at 6:34 PM, David Goldsmith d.l.goldsm...@gmail.com
wrote:

 I can certainly get one started (but I don't think I can faithfully
 summarize all this thread's current content, so I apologize in advance
for
 leaving that undone).

 DG

 OK, I'm lost already: is there general agreement that this should
jump straight to one or more NEP's?  If not (or if there should be a Wiki
page for it additionally), should such become part of the NumPy Wiki @
Sourceforge or the SciPy Wiki at the scipy.org site?  If the latter, is
one's SciPy Wiki login the same as one's mailing list subscriber
maintenance login?  I guess starting such a page is not as trivial as I had
assumed.

The wiki is frozen. Please do not add anything to it. It plays no role in
our current development workflow. Drafting a NEP or two and iterating on
them would be the next step.

--
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-21 Thread Andrew Collette
Hi Chris,

Just stumbled on this discussion (I'm the lead author of h5py).

We would be overjoyed if there were a 1-byte text type available in
NumPy.  String handling is the source of major pain right now in the
HDF5 world.  All HDF5 strings are text (opaque types are used for
binary data), but we're forced into using the S type most of the
time because (1) the U type doesn't round-trip between HDF5 and
NumPy, as there's no fixed-width wide-character string type in HDF5,
and (2) U takes 4x the space, which is a problem for big scientific
datasets.

ASCII-only would be preferable, partly for selfish reasons (HDF5's
default is ASCII only), and partly to make it possible to copy them
into containers labelled UTF-8 without manually inspecting every
value.

 At the high-level interface, h5py exposes three kinds of strings. Each
 maps to a specific type within Python (but see str_py3 below):

 Fixed-length ASCII (NumPy S type)
 
 
 This is wrong, or mis-guided, or maybe only a little confusing -- 'S' is not
 an ASCII string (even though I wish it were...). But clearly the HDF folsk
 think we need one!

Yes, this was intended to state that the HDF5 Fixed-width ASCII type
maps to NumPy S at conversion time, which is obviously a wretched
solution on Py3.

 dset = f.create_dataset(string_ds, (100,), dtype=S10)
 
 Pardon my py3 ignorance -- is numpy.string_ the same as 'S' in py3? Form
 another post, I thought you'd need to use numpy.bytes_ (which is the same on
 py2)

It does produce an instance of 'numpy.bytes_', although I think the
h5py docs should be changed to use bytes_ explicitly.

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-21 Thread Chris Barker
On Tue, Jan 21, 2014 at 3:22 PM, Andrew Collette
andrew.colle...@gmail.comwrote:

 Just stumbled on this discussion (I'm the lead author of h5py).

 We would be overjoyed if there were a 1-byte text type available in
 NumPy.


cool -- it looks like someone is going to get a draft PEP going -- so stay
tuned, and add you comments when there is something to add them too..

 String handling is the source of major pain right now in the
 HDF5 world.  All HDF5 strings are text (opaque types are used for
 binary data), but we're forced into using the S type most of the
 time because (1) the U type doesn't round-trip between HDF5 and
 NumPy, as there's no fixed-width wide-character string type in HDF5,


it looks from here:
http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html

that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a
lot of calls to encode/decode -- which could be pretty slow, compared to
other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by
doesn't round trip.

This may be a good case for a numpy utf-8 dtype, I suppose (or a
arbitrary encoding dtype, anyway).
But: How does hdf handle the fact that utf-8 is not a fixed length encoding?

ASCII-only would be preferable, partly for selfish reasons (HDF5's
 default is ASCII only), and partly to make it possible to copy them
 into containers labelled UTF-8 without manually inspecting every
 value.


hmm -- ascii does have those advantages, but I'm not sure its worth the
restriction on what can be encoded. But you're quite right, you could dump
asciii straight into something expecting utf-8, whereas you could not do
that with latin-1, for instance. But you can't go the other way -- does it
help much to avoided encoding in one direction?

But maybe we can have a any-one-byte-per-char encoding option, in which
case hdfpy could use ascii, but we wouldn't have to everywhere.

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread David Goldsmith
Date: Tue, 21 Jan 2014 19:20:12 +

 From: Robert Kern robert.k...@gmail.com
 Subject: Re: [Numpy-discussion] A one-byte string dtype?



 The wiki is frozen. Please do not add anything to it. It plays no role in
 our current development workflow. Drafting a NEP or two and iterating on
 them would be the next step.

 --
 Robert Kern


OK, well that's definitely beyond my level of expertise.

DG
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] A one-byte string dtype?

2014-01-21 Thread Chris Barker - NOAA Federal
On Jan 21, 2014, at 4:58 PM, David Goldsmith d.l.goldsm...@gmail.com wrote:


 OK, well that's definitely beyond my level of expertise.

Well, it's in github--now's as good a time as any to learn github
collaboration...

-Fork the numpy source.

-Create a new file in:
numpy/doc/neps

Point folks to it here so they can comment, etc.

At some point, issue a pull request, and it can get merged into the
main source for final polishing...

-Chris







 DG
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-21 Thread Andrew Collette
Hi Chris,

 it looks from here:
 http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html

 that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a
 lot of calls to encode/decode -- which could be pretty slow, compared to
 other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by
 doesn't round trip.

HDF5 does have variable-length string support for UTF-8, so we map
that directly to the unicode type (str on Py3) exactly as you
describe, by encoding when we write to the file.  But there's no way
to round-trip with *fixed-width* strings.  You can go from e.g. a 10
byte ASCII string to U10, but going the other way fails if there are
characters which take more than 1 byte to represent.  We don't always
get to choose the destination type, when e.g. writing into an existing
dataset, so we can't always write vlen strings.

 This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary
 encoding dtype, anyway).
 But: How does hdf handle the fact that utf-8 is not a fixed length encoding?

With fixed-width strings it doesn't, really.  If you use vlen strings
it's fine, but otherwise there's just a fixed-width buffer labelled
UTF-8.  Presumably you're supposed to be careful when writing not to
chop the string off in the middle of a multibyte character.  We could
truncate strings on their way to the file, but the risk of data
loss/corruption led us to simply not support it at all.

 hmm -- ascii does have those advantages, but I'm not sure its worth the
 restriction on what can be encoded. But you're quite right, you could dump
 asciii straight into something expecting utf-8, whereas you could not do
 that with latin-1, for instance. But you can't go the other way -- does it
 help much to avoided encoding in one direction?

It would help for h5py specifically because most HDF5 strings are
labelled ASCII.  But it's a question for the community which is more
important: the high-bit characters in latin-1, or write-compatibility
with UTF-8.

Andrew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] fromiter cannot create array of object - was: Creating an ndarray from an iterable, over sequences

2014-01-21 Thread Dr. Leo
Hi,

thanks. Both recarray and itertools.chain work just fine in the example
case.

However, the real purpose of this is to read strings from a large xml
file into a pandas DataFrame. But fromiter cannot create arrays of dtype
'object'. Fixed length strings may be worth trying. But as the xml
schema does not guarantee a max. length, and pandas generally uses
'object' arrays for strings, I see no better way than creating the array
through list comprehensions and turn it into a DataFrame.

Maybe a variable length string/unicode type would help in the long term.

Leo



 I would like to write something like:

 In [25]: iterable=((i, i**2) for i in range(10))

 In [26]: a=np.fromiter(iterable, int32)
 ---
 ValueErrorTraceback (most recent call
 last)
 ipython-input-26-5bcc2e94dbca in module()
  1 a=np.fromiter(iterable, int32)

 ValueError: setting an array element with a sequence.


 Is there an efficient way to do this?

Perhaps you could just utilize structured arrays (
http://docs.scipy.org/doc/numpy/user/basics.rec.html), like:
iterable= ((i, i**2) for i in range(10))
a= np.fromiter(iterable, [('a', int32), ('b', int32)], 10)
a.view(int32).reshape(-1, 2)

You could use itertools:

 from itertools import chain
 g = ((i, i**2) for i in range(10))
 import numpy
 numpy.fromiter(chain.from_iterable(g), numpy.int32).reshape(-1, 2)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion