Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Oscar Benjamin
On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
 On Jan 22, 2014, at 1:13 PM, Oscar Benjamin oscar.j.benja...@gmail.com 
 wrote:
 
 
  It's not safe to stop removing the null bytes. This is how numpy determines
  the length of the strings in a dtype='S' array. The strings are not
  fixed-width but rather have a maximum width.
 
 Exactly--but folks have told us on this list that they want (and are)
 using the 'S' style for arbitrary bytes, NOT for text. In which case
 you wouldn't want to remove null bytes. This is more evidence that 'S'
 was designed to handle c-style one-byte-per-char strings, and NOT
 arbitrary bytes, and thus not to map directly to the py2 string type
 (you can store null bytes in a py2 string

You can store null bytes in a Py2 string but you normally wouldn't if it was
supposed to be text.

 
 Which brings me back to my original proposal: properly map the 'S'
 type to the py3 data model, and maybe add some kind of fixed width
 bytes style of there is a use case for that. I still have no idea what
 the use case might be.
 

There would definitely be a use case for a fixed-byte-width
bytes-representing-text dtype in record arrays to read from a binary file:

dt = np.dtype([
('name', '|b8:utf-8'),
('param1', 'i4'),
('param2', 'i4')
...
])

with open('binaryfile', 'rb') as fin:
a = np.fromfile(fin, dtype=dt)

You could also use this for ASCII if desired. I don't think it really matters
that utf-8 uses variable width as long as a too long byte string throws an
error (and does not truncate).

For non 8-bit encodings there would have to be some way to handle endianness
without a BOM, but otherwise I think that it's always possible to pad with zero
*bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
null *characters* after decoding. i.e.:

$ cat tmp.py 
import encodings

def test_encoding(s1, enc):
b = s1.encode(enc).ljust(32, b'\0')
s2 = b.decode(enc)
index = s2.find('\0')
if index != -1:
s2 = s2[:index]
assert s1 == s2, enc

encodings_set = set(encodings.aliases.aliases.values())

for N, enc in enumerate(encodings_set):
try:
test_encoding('qwe', enc)
except LookupError:
pass

print('Tested %d encodings without error' % N)
$ python3 tmp.py 
Tested 88 encodings without error

  If the trailing nulls are not removed then you would get:
 
  a[0]
  b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
  len(a[0])
  9
 
  And I'm sure that someone would get upset about that.
 
 Only if they are using it for text-which you should not do with py3.

But people definitely are using it for text on Python 3. It should be
deprecated in favour of something new but breaking it is just gratuitous.
Numpy doesn't have the option to make a clean break with Python 3 precisely
because it needs to straddle 2.x and 3.x while numpy-based applications are
ported to 3.x.

  Some more oddities:
 
  a[0] = 1
  a
  array([b'1', b'string', b'of', b'different', b'length', b'words'],
   dtype='|S9')
  a[0] = None
  a
  array([b'None', b'string', b'of', b'different', b'length', b'words'],
   dtype='|S9')
 
 More evidence that this is a text type.

And the big one:

$ python3
Python 3.2.3 (default, Sep 25 2013, 18:22:43) 
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 import numpy as np
 a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
 a
array([b'asd', b'zxc'], 
  dtype='|S3')
 a[0] = 'qwer' # Unicode string again
 a
array([b'qwe', b'zxc'], 
  dtype='|S3')
 a[0] = 'Õscar'
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: 
ordinal not in range(128)

The analogous behaviour was very deliberately removed from Python 3:

 a[0] == 'qwe'
False
 a[0] == b'qwe'
True


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] IRR

2014-01-23 Thread Nathaniel Smith
Hey all,

We have a PR languishing that fixes np.irr to handle negative rate-of-returns:
  https://github.com/numpy/numpy/pull/4210
I don't even know what IRR stands for, and it seems rather confusing
from the discussion there. Anyone who knows something about the issues
is invited to speak up...

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
 On Jan 22, 2014, at 1:13 PM, Oscar Benjamin oscar.j.benja...@gmail.com 
 wrote:

 
  It's not safe to stop removing the null bytes. This is how numpy determines
  the length of the strings in a dtype='S' array. The strings are not
  fixed-width but rather have a maximum width.

 Exactly--but folks have told us on this list that they want (and are)
 using the 'S' style for arbitrary bytes, NOT for text. In which case
 you wouldn't want to remove null bytes. This is more evidence that 'S'
 was designed to handle c-style one-byte-per-char strings, and NOT
 arbitrary bytes, and thus not to map directly to the py2 string type
 (you can store null bytes in a py2 string

 You can store null bytes in a Py2 string but you normally wouldn't if it was
 supposed to be text.


 Which brings me back to my original proposal: properly map the 'S'
 type to the py3 data model, and maybe add some kind of fixed width
 bytes style of there is a use case for that. I still have no idea what
 the use case might be.


 There would definitely be a use case for a fixed-byte-width
 bytes-representing-text dtype in record arrays to read from a binary file:

 dt = np.dtype([
 ('name', '|b8:utf-8'),
 ('param1', 'i4'),
 ('param2', 'i4')
 ...
 ])

 with open('binaryfile', 'rb') as fin:
 a = np.fromfile(fin, dtype=dt)

 You could also use this for ASCII if desired. I don't think it really matters
 that utf-8 uses variable width as long as a too long byte string throws an
 error (and does not truncate).

 For non 8-bit encodings there would have to be some way to handle endianness
 without a BOM, but otherwise I think that it's always possible to pad with 
 zero
 *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
 null *characters* after decoding. i.e.:

 $ cat tmp.py
 import encodings

 def test_encoding(s1, enc):
 b = s1.encode(enc).ljust(32, b'\0')
 s2 = b.decode(enc)
 index = s2.find('\0')
 if index != -1:
 s2 = s2[:index]
 assert s1 == s2, enc

 encodings_set = set(encodings.aliases.aliases.values())

 for N, enc in enumerate(encodings_set):
 try:
 test_encoding('qwe', enc)
 except LookupError:
 pass

 print('Tested %d encodings without error' % N)
 $ python3 tmp.py
 Tested 88 encodings without error

  If the trailing nulls are not removed then you would get:
 
  a[0]
  b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
  len(a[0])
  9
 
  And I'm sure that someone would get upset about that.

 Only if they are using it for text-which you should not do with py3.

 But people definitely are using it for text on Python 3. It should be
 deprecated in favour of something new but breaking it is just gratuitous.
 Numpy doesn't have the option to make a clean break with Python 3 precisely
 because it needs to straddle 2.x and 3.x while numpy-based applications are
 ported to 3.x.

  Some more oddities:
 
  a[0] = 1
  a
  array([b'1', b'string', b'of', b'different', b'length', b'words'],
   dtype='|S9')
  a[0] = None
  a
  array([b'None', b'string', b'of', b'different', b'length', b'words'],
   dtype='|S9')

 More evidence that this is a text type.

 And the big one:

 $ python3
 Python 3.2.3 (default, Sep 25 2013, 18:22:43)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 import numpy as np
 a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
 a
 array([b'asd', b'zxc'],
   dtype='|S3')
 a[0] = 'qwer' # Unicode string again
 a
 array([b'qwe', b'zxc'],
   dtype='|S3')
 a[0] = 'Õscar'
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 
 0: ordinal not in range(128)



 The analogous behaviour was very deliberately removed from Python 3:

 a[0] == 'qwe'
 False
 a[0] == b'qwe'
 True


 Oscar
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
 On Jan 22, 2014, at 1:13 PM, Oscar Benjamin oscar.j.benja...@gmail.com 
 wrote:

 
  It's not safe to stop removing the null bytes. This is how numpy determines
  the length of the strings in a dtype='S' array. The strings are not
  fixed-width but rather have a maximum width.

 Exactly--but folks have told us on this list that they want (and are)
 using the 'S' style for arbitrary bytes, NOT for text. In which case
 you wouldn't want to remove null bytes. This is more evidence that 'S'
 was designed to handle c-style one-byte-per-char strings, and NOT
 arbitrary bytes, and thus not to map directly to the py2 string type
 (you can store null bytes in a py2 string

 You can store null bytes in a Py2 string but you normally wouldn't if it was
 supposed to be text.


 Which brings me back to my original proposal: properly map the 'S'
 type to the py3 data model, and maybe add some kind of fixed width
 bytes style of there is a use case for that. I still have no idea what
 the use case might be.


 There would definitely be a use case for a fixed-byte-width
 bytes-representing-text dtype in record arrays to read from a binary file:

 dt = np.dtype([
 ('name', '|b8:utf-8'),
 ('param1', 'i4'),
 ('param2', 'i4')
 ...
 ])

 with open('binaryfile', 'rb') as fin:
 a = np.fromfile(fin, dtype=dt)

 You could also use this for ASCII if desired. I don't think it really matters
 that utf-8 uses variable width as long as a too long byte string throws an
 error (and does not truncate).

 For non 8-bit encodings there would have to be some way to handle endianness
 without a BOM, but otherwise I think that it's always possible to pad with 
 zero
 *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
 null *characters* after decoding. i.e.:

 $ cat tmp.py
 import encodings

 def test_encoding(s1, enc):
 b = s1.encode(enc).ljust(32, b'\0')
 s2 = b.decode(enc)
 index = s2.find('\0')
 if index != -1:
 s2 = s2[:index]
 assert s1 == s2, enc

 encodings_set = set(encodings.aliases.aliases.values())

 for N, enc in enumerate(encodings_set):
 try:
 test_encoding('qwe', enc)
 except LookupError:
 pass

 print('Tested %d encodings without error' % N)
 $ python3 tmp.py
 Tested 88 encodings without error

  If the trailing nulls are not removed then you would get:
 
  a[0]
  b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
  len(a[0])
  9
 
  And I'm sure that someone would get upset about that.

 Only if they are using it for text-which you should not do with py3.

 But people definitely are using it for text on Python 3. It should be
 deprecated in favour of something new but breaking it is just gratuitous.
 Numpy doesn't have the option to make a clean break with Python 3 precisely
 because it needs to straddle 2.x and 3.x while numpy-based applications are
 ported to 3.x.

  Some more oddities:
 
  a[0] = 1
  a
  array([b'1', b'string', b'of', b'different', b'length', b'words'],
   dtype='|S9')
  a[0] = None
  a
  array([b'None', b'string', b'of', b'different', b'length', b'words'],
   dtype='|S9')

 More evidence that this is a text type.

 And the big one:

 $ python3
 Python 3.2.3 (default, Sep 25 2013, 18:22:43)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 import numpy as np
 a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
 a
 array([b'asd', b'zxc'],
   dtype='|S3')
 a[0] = 'qwer' # Unicode string again
 a
 array([b'qwe', b'zxc'],
   dtype='|S3')
 a[0] = 'Õscar'
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 
 0: ordinal not in range(128)

looks mostly like casting rules to me, which looks like ASCII based
instead of an arbitrary encoding.

 a = np.array(['asd', 'zxc'], dtype='S')
 b = a.astype('U')
 b[0] = 'Õscar'
 a[0] = 'Õscar'
Traceback (most recent call last):
  File pyshell#17, line 1, in module
a[0] = 'Õscar'
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
 b
array(['Õsc', 'zxc'],
  dtype='U3')
 b.astype('S')
Traceback (most recent call last):
  File pyshell#19, line 1, in module
b.astype('S')
UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
position 0: ordinal not in range(128)
 b.view('S4')
array([b'\xd5', b's', b'c', b'z', b'x', b'c'],
  dtype='|S4')

 a.astype('U').astype('S')
array([b'asd', b'zxc'],
  dtype='|S3')

Josef


 The analogous behaviour was very deliberately removed from Python 3:

 a[0] == 'qwe'
 False
 a[0] == b'qwe'
 True


 Oscar
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 

Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
On Thu, Jan 23, 2014 at 10:41 AM,  josef.p...@gmail.com wrote:
 On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin
 oscar.j.benja...@gmail.com wrote:
 On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote:
 On Jan 22, 2014, at 1:13 PM, Oscar Benjamin oscar.j.benja...@gmail.com 
 wrote:

 
  It's not safe to stop removing the null bytes. This is how numpy 
  determines
  the length of the strings in a dtype='S' array. The strings are not
  fixed-width but rather have a maximum width.

 Exactly--but folks have told us on this list that they want (and are)
 using the 'S' style for arbitrary bytes, NOT for text. In which case
 you wouldn't want to remove null bytes. This is more evidence that 'S'
 was designed to handle c-style one-byte-per-char strings, and NOT
 arbitrary bytes, and thus not to map directly to the py2 string type
 (you can store null bytes in a py2 string

 You can store null bytes in a Py2 string but you normally wouldn't if it was
 supposed to be text.


 Which brings me back to my original proposal: properly map the 'S'
 type to the py3 data model, and maybe add some kind of fixed width
 bytes style of there is a use case for that. I still have no idea what
 the use case might be.


 There would definitely be a use case for a fixed-byte-width
 bytes-representing-text dtype in record arrays to read from a binary file:

 dt = np.dtype([
 ('name', '|b8:utf-8'),
 ('param1', 'i4'),
 ('param2', 'i4')
 ...
 ])

 with open('binaryfile', 'rb') as fin:
 a = np.fromfile(fin, dtype=dt)

 You could also use this for ASCII if desired. I don't think it really matters
 that utf-8 uses variable width as long as a too long byte string throws an
 error (and does not truncate).

 For non 8-bit encodings there would have to be some way to handle endianness
 without a BOM, but otherwise I think that it's always possible to pad with 
 zero
 *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip
 null *characters* after decoding. i.e.:

 $ cat tmp.py
 import encodings

 def test_encoding(s1, enc):
 b = s1.encode(enc).ljust(32, b'\0')
 s2 = b.decode(enc)
 index = s2.find('\0')
 if index != -1:
 s2 = s2[:index]
 assert s1 == s2, enc

 encodings_set = set(encodings.aliases.aliases.values())

 for N, enc in enumerate(encodings_set):
 try:
 test_encoding('qwe', enc)
 except LookupError:
 pass

 print('Tested %d encodings without error' % N)
 $ python3 tmp.py
 Tested 88 encodings without error

  If the trailing nulls are not removed then you would get:
 
  a[0]
  b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00'
  len(a[0])
  9
 
  And I'm sure that someone would get upset about that.

 Only if they are using it for text-which you should not do with py3.

 But people definitely are using it for text on Python 3. It should be
 deprecated in favour of something new but breaking it is just gratuitous.
 Numpy doesn't have the option to make a clean break with Python 3 precisely
 because it needs to straddle 2.x and 3.x while numpy-based applications are
 ported to 3.x.

  Some more oddities:
 
  a[0] = 1
  a
  array([b'1', b'string', b'of', b'different', b'length', b'words'],
   dtype='|S9')
  a[0] = None
  a
  array([b'None', b'string', b'of', b'different', b'length', b'words'],
   dtype='|S9')

 More evidence that this is a text type.

 And the big one:

 $ python3
 Python 3.2.3 (default, Sep 25 2013, 18:22:43)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 import numpy as np
 a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings
 a
 array([b'asd', b'zxc'],
   dtype='|S3')
 a[0] = 'qwer' # Unicode string again
 a
 array([b'qwe', b'zxc'],
   dtype='|S3')
 a[0] = 'Õscar'
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 
 0: ordinal not in range(128)

 looks mostly like casting rules to me, which looks like ASCII based
 instead of an arbitrary encoding.

 a = np.array(['asd', 'zxc'], dtype='S')
 b = a.astype('U')
 b[0] = 'Õscar'
 a[0] = 'Õscar'
 Traceback (most recent call last):
   File pyshell#17, line 1, in module
 a[0] = 'Õscar'
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
 position 0: ordinal not in range(128)
 b
 array(['Õsc', 'zxc'],
   dtype='U3')
 b.astype('S')
 Traceback (most recent call last):
   File pyshell#19, line 1, in module
 b.astype('S')
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
 position 0: ordinal not in range(128)
 b.view('S4')
 array([b'\xd5', b's', b'c', b'z', b'x', b'c'],
   dtype='|S4')

 a.astype('U').astype('S')
 array([b'asd', b'zxc'],
   dtype='|S3')


another curious example, encode utf-8 to latin-1 bytes

 b
array(['Õsc', 'zxc'],
  dtype='U3')
 b[0].encode('utf8')
b'\xc3\x95sc'
 b[0].encode('latin1')
b'\xd5sc'
 b.astype('S')
Traceback (most recent 

Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.p...@gmail.com wrote:

 another curious example, encode utf-8 to latin-1 bytes

  b
 array(['Õsc', 'zxc'],
   dtype='U3')
  b[0].encode('utf8')
 b'\xc3\x95sc'
  b[0].encode('latin1')
 b'\xd5sc'
  b.astype('S')
 Traceback (most recent call last):
   File pyshell#40, line 1, in module
 b.astype('S')
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
 position 0: ordinal not in range(128)
  c = b.view('S4').astype('S1').view('S3')
  c
 array([b'\xd5sc', b'zxc'],
   dtype='|S3')
  c[0].decode('latin1')
 'Õsc'

 Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
 ascii:

 np.array(['Õsc']).astype('S4')
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 
 0: ordinal not in range(128)
 np.array(['Õsc']).view('S4')
 array([b'\xd5', b's', b'c'],
   dtype='|S4')


No, a view doesn't change the memory, it just changes the
interpretation and there shouldn't be any conversion involved.
astype does type conversion, but it goes through ascii encoding which fails.

 b = np.array(['Õsc', 'zxc'], dtype='U3')
 b.tostring()
b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
 b.view('S12')
array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
  dtype='|S12')

The conversion happens somewhere in the array creation, but I have no
idea about the memory encoding for uc2 and the low level layouts.

Josef


 
 The original numpy py3 conversion used latin-1 as default
 (It's still used in statsmodels, and I haven't looked at the structure
 under the common py2-3 codebase)

 if sys.version_info[0] = 3:
 import io
 bytes = bytes
 unicode = str
 asunicode = str

 These two functions are an abomination:

 def asbytes(s):
 if isinstance(s, bytes):
 return s
 return s.encode('latin1')
 def asstr(s):
 if isinstance(s, str):
 return s
 return s.decode('latin1')


 Oscar
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
On Thu, Jan 23, 2014 at 11:58 AM,  josef.p...@gmail.com wrote:
 On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin
 oscar.j.benja...@gmail.com wrote:
 On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.p...@gmail.com wrote:

 another curious example, encode utf-8 to latin-1 bytes

  b
 array(['Õsc', 'zxc'],
   dtype='U3')
  b[0].encode('utf8')
 b'\xc3\x95sc'
  b[0].encode('latin1')
 b'\xd5sc'
  b.astype('S')
 Traceback (most recent call last):
   File pyshell#40, line 1, in module
 b.astype('S')
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in
 position 0: ordinal not in range(128)
  c = b.view('S4').astype('S1').view('S3')
  c
 array([b'\xd5sc', b'zxc'],
   dtype='|S3')
  c[0].decode('latin1')
 'Õsc'

 Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses
 ascii:

 np.array(['Õsc']).astype('S4')
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 
 0: ordinal not in range(128)
 np.array(['Õsc']).view('S4')
 array([b'\xd5', b's', b'c'],
   dtype='|S4')


 No, a view doesn't change the memory, it just changes the
 interpretation and there shouldn't be any conversion involved.
 astype does type conversion, but it goes through ascii encoding which fails.

 b = np.array(['Õsc', 'zxc'], dtype='U3')
 b.tostring()
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
 b.view('S12')
 array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
   dtype='|S12')

 The conversion happens somewhere in the array creation, but I have no
 idea about the memory encoding for uc2 and the low level layouts.

utf8 encoded bytes

 a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S')
 a
array([b'\xc3\x95sc', b'zxc'],
  dtype='|S4')
 a.tostring()
b'\xc3\x95sczxc\x00'
 a.view('S8')
array([b'\xc3\x95sczxc'],
  dtype='|S8')

 a[0].decode('latin1')
'Ã\x95sc'
 a[0].decode('utf8')
'Õsc'

Josef


 Josef


 
 The original numpy py3 conversion used latin-1 as default
 (It's still used in statsmodels, and I haven't looked at the structure
 under the common py2-3 codebase)

 if sys.version_info[0] = 3:
 import io
 bytes = bytes
 unicode = str
 asunicode = str

 These two functions are an abomination:

 def asbytes(s):
 if isinstance(s, bytes):
 return s
 return s.encode('latin1')
 def asstr(s):
 if isinstance(s, str):
 return s
 return s.decode('latin1')


 Oscar
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] cannot decode 'S'

2014-01-23 Thread josef . pktd
truncating null bytes in 'S' breaks decoding that needs them

 a = np.array([si.encode('utf-16LE') for si in ['Õsc', 'zxc']], dtype='S')
 a
array([b'\xd5\x00s\x00c', b'z\x00x\x00c'],
  dtype='|S6')

 [ai.decode('utf-16LE') for ai in a]
Traceback (most recent call last):
  File pyshell#118, line 1, in module
[ai.decode('utf-16LE') for ai in a]
  File pyshell#118, line 1, in listcomp
[ai.decode('utf-16LE') for ai in a]
  File C:\Programs\Python33\lib\encodings\utf_16_le.py, line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x63 in position
4: truncated data

messy workaround (arrays in contrast to scalars are not truncated in `tostring`)

 [a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))]
['Õsc', 'zxc']

Found while playing with examples in the other thread.

Josef
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Oscar Benjamin
On 23 January 2014 17:42,  josef.p...@gmail.com wrote:
 On Thu, Jan 23, 2014 at 12:13 PM,  josef.p...@gmail.com wrote:
 On Thu, Jan 23, 2014 at 11:58 AM,  josef.p...@gmail.com wrote:

 No, a view doesn't change the memory, it just changes the
 interpretation and there shouldn't be any conversion involved.
 astype does type conversion, but it goes through ascii encoding which fails.

 b = np.array(['Õsc', 'zxc'], dtype='U3')
 b.tostring()
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
 b.view('S12')
 array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
   dtype='|S12')

 The conversion happens somewhere in the array creation, but I have no
 idea about the memory encoding for uc2 and the low level layouts.

 b = np.array(['Õsc', 'zxc'], dtype='U3')
 b[0].tostring()
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
 'Õsc'.encode('utf-32LE')
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'

 Is that the encoding for 'U' ?

On a little-endian system, yes. I realise what' happening now. 'U'
represents unicode characters as a 32-bit unsigned integer giving the
code point of the character. The first 256 code points are exactly the
256 characters representable with latin-1 in the same order.

So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in
latin-1. As a 32 bit integer the code point is 0x00d5 but in
little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So
when you reinterpret that as 'S4' it strips the remaining nulls to get
the byte string b'\xd5'. Which is the latin-1 encoding for the
character. The same will happen for any string of latin-1 characters.
However if you do have a code point of 256 or greater then you'll get
a byte strings of length 2 or more.

On a big-endian system I think you'd get b'\x00\x00\x00\xd5'.

 another sideeffect of null truncation: cannot decode truncated data

 b.view('S4').tostring()
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
 b.view('S4')[0]
 b'\xd5'
 b.view('S4')[0].tostring()
 b'\xd5'
 b.view('S4')[:1].tostring()
 b'\xd5\x00\x00\x00'

 b.view('S4')[0].decode('utf-32LE')
 Traceback (most recent call last):
   File pyshell#101, line 1, in module
 b.view('S4')[0].decode('utf-32LE')
   File C:\Programs\Python33\lib\encodings\utf_32_le.py, line 11, in decode
 return codecs.utf_32_le_decode(input, errors, True)
 UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
 0: truncated data

 b.view('S4')[:1].tostring().decode('utf-32LE')
 'Õ'

 numpy arrays need a decode and encode method

I'm not sure that they do. Rather there needs to be a text dtype that
knows what encoding to use in order to have a binary interface as
exposed by .tostring() and friends and but produce unicode strings
when indexed from Python code. Having both a text and a binary
interface to the same data implies having an encoding.


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Chris Barker
Thanks for poking into this all. I've lost track a bit, but I think:

The 'S' type is clearly broken on py3 (at least). I think that gives us
room to change it, and backward compatibly is less of an issue because it's
broken already -- do we need to preserve bug-for-bug compatibility? Maybe,
but I suspect in this case, not --  the code the works fine on py3 with
the 'S' type is probably only lucky that it hasn't encountered the issues
yet.

And no matter how you slice it, code being ported to py3 needs to deal with
text handling issues.

But here is where we stand:

The 'S' dtype:

 - was designed for one-byte-per-char text data.
 - was mapped to the py2 string type.
 - used the classic C null-terminated approach.
 - can be used for arbitrary bytes (as the py2 string type can), but not
quite, as it truncates null bytes -- so it really a bad idea to use it that
way.

Under py3:
  The 'S' type maps to the py3 bytes type, because that's the closest to
the py2 string type. But it also does some inconsistent things with
encoding, and does treat a lot of other things as text. But the py3 bytes
type does not have the same text handling as the py2 string type, so things
like:

s = 'a string'
np.array((s,), dtype='S')[0] == s

Gives you False, rather than True on py2. This is because a py3 string is
translated to the 'S' type (presumable with the default encoding, another
maybe not a good idea, but returns a bytes object, which does not compare
true to a py3 string. YOu can work aroudn this with varios calls to
encode() and decode, and/or using b'a string', but that is ugly, kludgy,
and doesn't work well with the py3 text model.


The py2 = py3 transition separated bytes and strings: strings are unicode,
and bytes are not to be used for text (directly). While there is some
text-related functionality still in bytes, the core devs are quite clear
that that is for special cases only, and not for general text processing.

I don't think numpy should fight this, but rather embrace the py3 text
model. The most natural way to do that is to use the existing 'U' dtype for
text. Really the best solution for most cases. (Like the above case)

However, there is a use case for a more efficient way to deal with text.
There are a couple ways to go about that that have been brought up here:

1: have a more efficient unicode dtype: variable length,
multiple encoding options, etc
- This is a fine idea that would support better text handling in numpy,
and _maybe_ better interaction with external libraries (HDF, etc...)

2: Have a one-byte-per-char text dtype:
  - This would be much easier to implement  fit into the current numpy
model, and satisfy a lot of common use cases for scientific data sets.

We could certainly do both, but I'd like to see (2) get done sooner than
later

A related issue is whether numpy needs a dtype analogous to py3 bytes --
I'm still not sure of the use-case there, so can't comment -- would it need
to be fixed length (fitting into the numpy data model better) or variable
length, or ??? Some folks are (apparently) using the current 'S' type in
this way, but I think that's ripe for errors, due to the null bytes issue.
Though maybe there is a null-bytes-are-special binary format that isn't
text -- I have no idea.

So what do we  do with 'S'? It really is pretty broken, so we have a couple
choices:

 (1)  depricate it, so that it stays around for backward compatibility
but encourage people to either use 'U' for text, or one of the new dtypes
that are yet to be implemented (maybe 's' for a one-byte-per-char dtype),
and use either uint8 or the new bytes dtype that is yet to be implemented.

 (2) fix it -- in this case, I think we need to be clear what it is:
 -- A one-byte-char-text type? If so, it should map to a py3 string,
and have a defined encoding (ascii or latin-1, probably), or even better a
settable encoding (but only for one-byte-per-char encodings -- I don't
think utf-8 is a good idea here, as a utf-8 encoded string is of unknown
length. (there is some room for debate here, as the 'S' type is fixed
length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as
long as it doesn't partially truncate in teh middle of a charactor)

   -- a bytes type? in which  case, we should clean out all teh
automatic conversion to-from text that iare in it now.

I vote for it being our one-byte text type -- it almost is already, and it
would make the easiest transition for folks from py2 to py3. But backward
compatibility is backward compatibility.

 numpy arrays need a decode and encode method


I'm not sure that they do. Rather there needs to be a text dtype that
 knows what encoding to use in order to have a binary interface as
 exposed by .tostring() and friends and but produce unicode strings
 when indexed from Python code. Having both a text and a binary
 interface to the same data implies having an encoding.


I  agree with Oscar here -- let's not conflate encode and decoded data --

Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
On Thu, Jan 23, 2014 at 1:49 PM, Chris Barker chris.bar...@noaa.gov wrote:


 s = 'a string'
 np.array((s,), dtype='S')[0] == s

 Gives you False, rather than True on py2. This is because a py3 string is
 translated to the 'S' type (presumable with the default encoding, another
 maybe not a good idea, but returns a bytes object, which does not compare
 true to a py3 string. YOu can work aroudn this with varios calls to encode()
 and decode, and/or using b'a string', but that is ugly, kludgy, and doesn't
 work well with the py3 text model.

I think this is just inconsistent casting rules in numpy,

numpy should either refuse to assign the wrong type, instead of using
the repr as in some of the earlier examples of Oscar

 s = np.inf
 np.array((s,), dtype=int)[0] == s
Traceback (most recent call last):
  File pyshell#126, line 1, in module
np.array((s,), dtype=int)[0] == s
OverflowError: cannot convert float infinity to integer

or use the **same** conversion/casting rules also during the
interaction with python as are used in assignments and array creation.

Josef
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] cannot decode 'S'

2014-01-23 Thread Chris Barker
Josef,

Nice find -- another reason why 'S' can NOT be used a-is for arbitrary
bytes.

See the other thread for my proposals about that.


 messy workaround (arrays in contrast to scalars are not truncated in
 `tostring`)

  [a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))]
 ['Õsc', 'zxc']


I think the real work around is to not try to store arbitrary bytes --
i.e. encoded text, in the 'S' dtype.

But  is there a convenient way to do it with other existing numpy types?

I tried to do it with uint8, and it's really awkward

-CHB





-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Chris Barker
On Thu, Jan 23, 2014 at 11:18 AM, josef.p...@gmail.com wrote:


 I think this is just inconsistent casting rules in numpy,

 numpy should either refuse to assign the wrong type, instead of using
 the repr as in some of the earlier examples of Oscar

  s = np.inf
  np.array((s,), dtype=int)[0] == s
 Traceback (most recent call last):
   File pyshell#126, line 1, in module
 np.array((s,), dtype=int)[0] == s
 OverflowError: cannot convert float infinity to integer

 or use the **same** conversion/casting rules also during the
 interaction with python as are used in assignments and array creation.


Exactly -- but what should those conversion/casting rules be? We can't
decide that unless we decide if 'S' is for text or for arbitrary bytes --
it can't be both. I say text, that's what it's mostly trying to do already.
But if it's bytes, fine, then some things still need cleaning up, and we
could really use a one-byte-text type.  and if it's text, then we may need
a bytes dtype.

Key here is that we don't  have the option of not breaking anything,
because there is a lot already broken.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
  numpy arrays need a decode and encode method


 I'm not sure that they do. Rather there needs to be a text dtype that
 knows what encoding to use in order to have a binary interface as
 exposed by .tostring() and friends and but produce unicode strings
 when indexed from Python code. Having both a text and a binary
 interface to the same data implies having an encoding.


 I  agree with Oscar here -- let's not conflate encode and decoded data --
 the py3 text model is a fine one, we should work with it as much as
 practical.

 UNLESS: if we do add a bytes dtype, then it would be a reasonable use case
 to use it to store encoded text (just like the py3 bytes types), in which
 case it would be good to have encode() and decode() methods or ufuncs --
 probably  ufuncs. But that should be for special purpose, at the I/O
 interface kind of stuff.


I think we need both things changing the memory and changing the view.

The same way we can convert between int and float and complex (trunc,
astype, real, ...) we should be able to convert between bytes and any
string (text) dtypes, i.e. decode and encode.

I'm reading a file in binary and then want to convert it to unicode,
only I realize I have only ascii and want to convert to something less
memory hungry.

views don't care about what the content means, it just has to be
memory compatible, I can view anything as an 'S' or a 'uint' (I
think).
What we currently don't have is a string/text view on S that would
interact with python as string.
(that's a vote in favor of a minimal one char string dtype that would
work for a limited number of encodings.)

Josef
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
On Thu, Jan 23, 2014 at 2:45 PM, Chris Barker chris.bar...@noaa.gov wrote:
 On Thu, Jan 23, 2014 at 11:18 AM, josef.p...@gmail.com wrote:


 I think this is just inconsistent casting rules in numpy,

 numpy should either refuse to assign the wrong type, instead of using
 the repr as in some of the earlier examples of Oscar

  s = np.inf
  np.array((s,), dtype=int)[0] == s
 Traceback (most recent call last):
   File pyshell#126, line 1, in module
 np.array((s,), dtype=int)[0] == s
 OverflowError: cannot convert float infinity to integer

 or use the **same** conversion/casting rules also during the
 interaction with python as are used in assignments and array creation.


 Exactly -- but what should those conversion/casting rules be? We can't
 decide that unless we decide if 'S' is for text or for arbitrary bytes -- it
 can't be both. I say text, that's what it's mostly trying to do already. But
 if it's bytes, fine, then some things still need cleaning up, and we could
 really use a one-byte-text type.  and if it's text, then we may need a bytes
 dtype.

(remember I'm just a balcony muppet)

As far as I understand all codecs have the same ascii part. So I would
cast on ascii and raise on anything else.

or follow whatever the convention of numpy is:

 s = -256
 np.array((s,), dtype=np.uint8)[0] == s
False
 s = -1
 np.array((s,), dtype=np.uint8)[0] == s
False


Josef


 Key here is that we don't  have the option of not breaking anything, because
 there is a lot already broken.

 -Chris


 --

 Christopher Barker, Ph.D.
 Oceanographer

 Emergency Response Division
 NOAA/NOS/ORR(206) 526-6959   voice
 7600 Sand Point Way NE   (206) 526-6329   fax
 Seattle, WA  98115   (206) 526-6317   main reception

 chris.bar...@noaa.gov

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread josef . pktd
On Thu, Jan 23, 2014 at 1:36 PM, Oscar Benjamin
oscar.j.benja...@gmail.com wrote:
 On 23 January 2014 17:42,  josef.p...@gmail.com wrote:
 On Thu, Jan 23, 2014 at 12:13 PM,  josef.p...@gmail.com wrote:
 On Thu, Jan 23, 2014 at 11:58 AM,  josef.p...@gmail.com wrote:

 No, a view doesn't change the memory, it just changes the
 interpretation and there shouldn't be any conversion involved.
 astype does type conversion, but it goes through ascii encoding which 
 fails.

 b = np.array(['Õsc', 'zxc'], dtype='U3')
 b.tostring()
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
 b.view('S12')
 array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'],
   dtype='|S12')

 The conversion happens somewhere in the array creation, but I have no
 idea about the memory encoding for uc2 and the low level layouts.

 b = np.array(['Õsc', 'zxc'], dtype='U3')
 b[0].tostring()
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'
 'Õsc'.encode('utf-32LE')
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00'

 Is that the encoding for 'U' ?

 On a little-endian system, yes. I realise what' happening now. 'U'
 represents unicode characters as a 32-bit unsigned integer giving the
 code point of the character. The first 256 code points are exactly the
 256 characters representable with latin-1 in the same order.

 So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in
 latin-1. As a 32 bit integer the code point is 0x00d5 but in
 little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So
 when you reinterpret that as 'S4' it strips the remaining nulls to get
 the byte string b'\xd5'. Which is the latin-1 encoding for the
 character. The same will happen for any string of latin-1 characters.
 However if you do have a code point of 256 or greater then you'll get
 a byte strings of length 2 or more.

 On a big-endian system I think you'd get b'\x00\x00\x00\xd5'.

I curious consequence of this, if we have only 1 character elements:

 a = np.array([si.encode('utf-16LE') for si in ['Õ', 'z']], dtype='S')
 a32 = np.array([si.encode('utf-32LE') for si in ['Õ', 'z']], dtype='S')
 a[0], a32[0]
(b'\xd5', b'\xd5')
 a[0] == a32[0]
True

 a32 = np.array([si.encode('utf-32BE') for si in ['Õ', 'z']], dtype='S')
 a = np.array([si.encode('utf-16BE') for si in ['Õ', 'z']], dtype='S')
 a[0], a32[0]
(b'\x00\xd5', b'\x00\x00\x00\xd5')
 a[0] == a32[0]
False

Josef




 another sideeffect of null truncation: cannot decode truncated data

 b.view('S4').tostring()
 b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00'
 b.view('S4')[0]
 b'\xd5'
 b.view('S4')[0].tostring()
 b'\xd5'
 b.view('S4')[:1].tostring()
 b'\xd5\x00\x00\x00'

 b.view('S4')[0].decode('utf-32LE')
 Traceback (most recent call last):
   File pyshell#101, line 1, in module
 b.view('S4')[0].decode('utf-32LE')
   File C:\Programs\Python33\lib\encodings\utf_32_le.py, line 11, in decode
 return codecs.utf_32_le_decode(input, errors, True)
 UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position
 0: truncated data

 b.view('S4')[:1].tostring().decode('utf-32LE')
 'Õ'

 numpy arrays need a decode and encode method

 I'm not sure that they do. Rather there needs to be a text dtype that
 knows what encoding to use in order to have a binary interface as
 exposed by .tostring() and friends and but produce unicode strings
 when indexed from Python code. Having both a text and a binary
 interface to the same data implies having an encoding.


 Oscar
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Text array dtype for numpy

2014-01-23 Thread Oscar Benjamin
There have been a few threads discussing the problems of how to do
text with numpy arrays in Python 3.

To make a slightly more concrete proposal, I've implemented a pure
Python ndarray subclass that I believe can consistently handle
text/bytes in Python 3. It is intended to be an illustration since I
think that the real solution is a new dtype rather than an array
subclass (so that it can be used in e.g. record arrays).

The idea is that the array has an encoding. It stores strings as
bytes. The bytes are encoded/decoded on insertion/access. Methods
accessing the binary content of the array will see the encoded bytes.
Methods accessing the elements of the array will see unicode strings.

I believe it would not be as hard to implement as the proposals for
variable length string arrays. The one caveat is that it will strip
null characters from the end of any string. I'm not 100% that the byte
stripping encoding function will always work but it will for all the
encodings I know and it seems to work with all the encodings that
Python has.

The code is inline below and attached (in case there are encoding
problems with this message!):

Oscar

#!/usr/bin/env python3

from numpy import ndarray, array

class textarray(ndarray):
'''ndarray for holding encoded text.

This is for demonstration purposes only. The real proposal
is to specify the encoding as a dtype rather than a subclass.

Only works as a 1-d array.

 a = textarray(['qwert', 'zxcvb'], encoding='ascii')
 a
textarray(['qwert', 'zxcvb'],
  dtype='|S5:ascii')
 a[0]
'qwert'
 a.tostring()
b'qwertzxcvb'

 a[0] = 'qwe'  # shorter string
 a[0]
'qwe'
 a.tostring()
b'qwe\\x00\\x00zxcvb'

 a[0] = 'qwertyuiop'  # longer string
Traceback (most recent call last):
...
ValueError: Encoded bytes don't fit

 b = textarray(['Õscar', 'qwe'], encoding='utf-8')
 b
textarray(['Õscar', 'qwe'],
  dtype='|S6:utf-8')
 b[0]
'Õscar'
 b[0].encode('utf-8')
b'\\xc3\\x95scar'
 b.tostring()
b'\\xc3\\x95scarqwe\\x00\\x00\\x00'

 c = textarray(['qwe'], encoding='utf-32-le')
 c
textarray(['qwe'],
  dtype='|S12:utf-32-le')

'''
def __new__(cls, strings, encoding='utf-8'):
bytestrings = [s.encode(encoding) for s in strings]
a = array(bytestrings, dtype='S').view(textarray)
a.encoding = encoding
return a

def __repr__(self):
slist = ', '.join(repr(self[n]) for n in range(len(self)))
return textarray([%s], \n  dtype='|S%d:%s')\
   % (slist, self.itemsize, self.encoding)

def __getitem__(self, index):
bstring = ndarray.__getitem__(self, index)
return self._decode(bstring)

def __setitem__(self, index, string):
bstring = string.encode(self.encoding)
if len(bstring)  self.itemsize:
raise ValueError(Encoded bytes don't fit)
ndarray.__setitem__(self, index, bstring)

def _decode(self, b):
b = b + b'\0' * (4 - len(b) % 4)
s = b.decode(self.encoding)
for n, c in enumerate(reversed(s)):
if c != '\0':
return s[:len(s)-n]
return s

if __name__ == __main__:
import doctest
doctest.testmod()
#!/usr/bin/env python3

from numpy import ndarray, array

class textarray(ndarray):
'''ndarray for holding encoded text.

This is for demonstration purposes only. The real proposal
is to specify the encoding as a dtype rather than a subclass.

Only works as a 1-d array.

 a = textarray(['qwert', 'zxcvb'], encoding='ascii')
 a
textarray(['qwert', 'zxcvb'], 
  dtype='|S5:ascii')
 a[0]
'qwert'
 a.tostring()
b'qwertzxcvb'

 a[0] = 'qwe'  # shorter string
 a[0]
'qwe'
 a.tostring()
b'qwe\\x00\\x00zxcvb'

 a[0] = 'qwertyuiop'  # longer string
Traceback (most recent call last):
...
ValueError: Encoded bytes don't fit

 b = textarray(['Õscar', 'qwe'], encoding='utf-8')
 b
textarray(['Õscar', 'qwe'], 
  dtype='|S6:utf-8')
 b[0]
'Õscar'
 b[0].encode('utf-8')
b'\\xc3\\x95scar'
 b.tostring()
b'\\xc3\\x95scarqwe\\x00\\x00\\x00'

 c = textarray(['qwe'], encoding='utf-32-le')
 c
textarray(['qwe'], 
  dtype='|S12:utf-32-le')

'''
def __new__(cls, strings, encoding='utf-8'):
bytestrings = [s.encode(encoding) for s in strings]
a = array(bytestrings, dtype='S').view(textarray)
a.encoding = encoding
return a

def __repr__(self):
slist = ', '.join(repr(self[n]) for n in range(len(self)))
return textarray([%s], \n  dtype='|S%d:%s')\
   % (slist, self.itemsize, self.encoding)

def __getitem__(self, index):
bstring = ndarray.__getitem__(self, index)
return self._decode(bstring)

def __setitem__(self, index, string):

Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Chris Barker
On Thu, Jan 23, 2014 at 12:10 PM, josef.p...@gmail.com wrote:

  Exactly -- but what should those conversion/casting rules be? We can't
  decide that unless we decide if 'S' is for text or for arbitrary bytes
 -- it
  can't be both. I say text, that's what it's mostly trying to do already.
 But
  if it's bytes, fine, then some things still need cleaning up, and we
 could
  really use a one-byte-text type.  and if it's text, then we may need a
 bytes
  dtype.

 (remember I'm just a balcony muppet)


me too ;-)



 As far as I understand all codecs have the same ascii part.


nope -- certainly not multi-byte codecs. And one of the key points of utf-8
is that the ascii part is compatible -- none of teh other full-unicode
encoding are.

many of the one-byte-per-char ones do share the ascii part, but not all, or
not completely.

So I would
 cast on ascii and raise on anything else.


still a fine option -- clearly defined and quite useful for scientific
text. However, I would prefer latin-1 -- that way  you  might get garbage
for the non-ascii parts, but it wouldn't raise an exception and it
round-trips through encoding/decoding. And you would have a somewhat more
useful subset -- including the latin-language character and symbols like
the degree symbol, etc.


 or follow whatever the convention of numpy is:

  s = -256
  np.array((s,), dtype=np.uint8)[0] == s
 False
  s = -1
  np.array((s,), dtype=np.uint8)[0] == s
 False


I  think text is distinct enough from  numbers that we don't need to do
that same thing -- and this is result of well-defined casting rules built
into the compiler (and hardware?) for the numeric types. I dont hink we
have either the standard or compiler support for text conversions like that.

-CHB

PS: this is interesting, on py2:


In [176]: a = np.array((,), dtype='S')

In [177]: a
Out[177]:
array(['2'],
  dtype='|S1')

It converts it to a string, but only grabs the first character? (is
it determining the size before converting to a string?

and this:

In [182]: a = np.array(, dtype='S')

In [183]: a
Out[183]:
array('',
  dtype='|S24')

24 ? where did that come from?













 Josef

 
  Key here is that we don't  have the option of not breaking anything,
 because
  there is a lot already broken.
 
  -Chris
 
 
  --
 
  Christopher Barker, Ph.D.
  Oceanographer
 
  Emergency Response Division
  NOAA/NOS/ORR(206) 526-6959   voice
  7600 Sand Point Way NE   (206) 526-6329   fax
  Seattle, WA  98115   (206) 526-6317   main reception
 
  chris.bar...@noaa.gov
 
  ___
  NumPy-Discussion mailing list
  NumPy-Discussion@scipy.org
  http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] (no subject)

2014-01-23 Thread jennifer stone
Both scipy and numpy require GSOC

 candidates to have a pull request accepted as part of the application
 process. I'd suggest implementing a function not currently in scipy that
 you think would be useful. That would also help in finding a mentor for the
 summer. I'd also suggest getting familiar with cython.

 Chuck


Thanks a lot for the heads-up. I am yet to be familiarized with Cython and
it indeed is playing a crucial role especially in the 'special' module


 I don't see you on github yet, are you there? If not, you should set up an
 account to work in. See the developer guide
 http://docs.scipy.org/doc/numpy/dev/for some pointers.

 Chuck

I am present on github but the profile at present is just a mark of humble
mistakes of a beginner to open-sourcing, The id is
https://github.com/jennystone.
I hope to build upon my profile.

Jennifer
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] (no subject)

2014-01-23 Thread jennifer stone
Scipy doesn't have a function for the Laplace transform, it has only a
 Laplace distribution in scipy.stats and a Laplace filter in scipy.ndimage.
 An inverse Laplace transform would be very welcome I'd think - it has real
 world applications, and there's no good implementation in any open source
 library as far as I can tell. It's probably doable, but not the easiest
 topic for a GSoC I think. From what I can find, the paper Numerical
 Transform Inversion Using Gaussian Quadrature from den Iseger contains
 what's considered the current state of the art algorithm. Browsing that
 gives a reasonable idea of the difficulty of implementing `ilaplace`.


A brief scanning through the paper Numerical Transform Inversion Using
Gaussian Quadrature from den Iseger does indicate the complexity of the
algorithm. But GSoC project or not, can't we work on it, step by step? As I
would love to see a contender for Matlab's ilaplace on open source front!!


 You can have a look at https://github.com/scipy/scipy/pull/2908/files for
 ideas. Most of the things that need improving or we really think we should
 have in Scipy are listed there. Possible topics are not restricted to that
 list though - it's more important that you pick something you're interested
 in and have the required background and coding skills for.


Thanks a lot for the roadmap. Of the options provided, I found the
'Cython'ization of Cluster great. Would it be possible to do it as the
Summer project if I spend the month learning Cython?

Regards
Janani



 Cheers,
 Ralf



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] De Bruijn sequence

2014-01-23 Thread Vincent Davis
I happen to be working with De Bruijn sequences. Is there any interest in
this being part of numpy/scipy?

https://gist.github.com/vincentdavis/8588879

Vincent Davis
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Oscar Benjamin
On 23 January 2014 21:51, Chris Barker chris.bar...@noaa.gov wrote:

 However, I would prefer latin-1 -- that way  you  might get garbage for the
 non-ascii parts, but it wouldn't raise an exception and it round-trips
 through encoding/decoding. And you would have a somewhat more useful subset
 -- including the latin-language character and symbols like the degree
 symbol, etc.

Exceptions and error messages are a good thing! Garbage is not!!!  :)


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Chris Barker
On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin
oscar.j.benja...@gmail.comwrote:

 On 23 January 2014 21:51, Chris Barker chris.bar...@noaa.gov wrote:
 
  However, I would prefer latin-1 -- that way  you  might get garbage for
 the
  non-ascii parts, but it wouldn't raise an exception and it round-trips
  through encoding/decoding. And you would have a somewhat more useful
 subset
  -- including the latin-language character and symbols like the degree
  symbol, etc.

 Exceptions and error messages are a good thing! Garbage is not!!!  :)


in principle, I agree with you, but sometime practicality beets purity.

in py2 there is a lot of implicit encoding/decoding going on, using the
system encoding. That is ascii on a lot of systems. The result is that
there is a lot of code out there that folks have ported to use unicode, but
missed a few corners. If that code is only testes with ascii, it all seems
o be working but then out in the wild someone
puts another character in there and presto -- a crash.

Also, there are places where the inability to encode makes silent message
-- for instance if an Exception is raised with a unicode message, it will
get silently dropped when it comes time to display on the terminal. I spent
quite a wile banging my head against that one recently when I tried to
update some code to read unicode files. I would have been MUCH happier with
a bit of garbage in the mesae than having it drop (or raise
an encoding error in the middle of the error...)

I think this is a bad thing.

The advantage of latin-1 is that while  you might get something that
doesn't print right, it won't crash, and it won't contaminate the data, so
comparisons, etc, will still work. kind of like using utf-8 in an old-style
c char array -- you can still passi t around and copare it, even if the
bytes dont mean what you think they do.

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Chris Barker
On Thu, Jan 23, 2014 at 3:56 PM, josef.p...@gmail.com wrote:


 I'm not sure anymore, after all these threads I think bytes should be
 bytes and strings should be strings


exactly -- that's the py3 model, and I think we really soudl try to conform
to it, it's really the only way to have a robust solution.


 I like the idea of an `encoding_view` on some 'S' bytes, and once we
 have a view like that there is no reason to pretend 'S' bytes are
 text.


right, then they are bytes, not text. period.

I'm not sure if we should conflate encoded text and arbitrary bytes, but it
does make sense to build encoded text on a bytes object.

maybe I didn't pay attention because I didn't care, until we ran into
 the python 3 problems. maybe nobody else did either.


yup -- I think this didn't get a whole lot of review or testing

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array

2014-01-23 Thread Oscar Benjamin
On 24 January 2014 01:09, Chris Barker chris.bar...@noaa.gov wrote:
 On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin oscar.j.benja...@gmail.com
 wrote:

 On 23 January 2014 21:51, Chris Barker chris.bar...@noaa.gov wrote:
 
  However, I would prefer latin-1 -- that way  you  might get garbage for
  the
  non-ascii parts, but it wouldn't raise an exception and it round-trips
  through encoding/decoding. And you would have a somewhat more useful
  subset
  -- including the latin-language character and symbols like the degree
  symbol, etc.

 Exceptions and error messages are a good thing! Garbage is not!!!  :)

 in principle, I agree with you, but sometime practicality beets purity.

 in py2 there is a lot of implicit encoding/decoding going on, using the
 system encoding. That is ascii on a lot of systems. The result is that there
 is a lot of code out there that folks have ported to use unicode, but missed
 a few corners. If that code is only testes with ascii, it all seems o be
 working but then out in the wild someone puts another character in there and
 presto -- a crash.

Precisely. The Py3 text model uses TypeErrors to warn early against
this kind of thing. No longer do you have code that seems to work
until the wrong character goes in. You get the error straight away
when you try to mix bytes and text. You still have the option to
silence those errors: it just needs to be done explicitly:

 s = 'Õscar'
 s.encode('ascii', errors='replace')
b'?scar'

 Also, there are places where the inability to encode makes silent message --
 for instance if an Exception is raised with a unicode message, it will get
 silently dropped when it comes time to display on the terminal. I spent
 quite a wile banging my head against that one recently when I tried to
 update some code to read unicode files. I would have been MUCH happier with
 a bit of garbage in the mesae than having it drop (or raise an encoding
 error in the middle of the error...)

Yeah, that's just a bug in CPython. I think it's fixed now but either
way you're right: for the particular case of displaying error messages
the interpreter should do whatever it takes to get some kind of error
message out even if it's a bit garbled. I disagree that this should be
the basis for ordinary data processing with numpy though.

 I think this is a bad thing.

 The advantage of latin-1 is that while  you might get something that doesn't
 print right, it won't crash, and it won't contaminate the data, so
 comparisons, etc, will still work. kind of like using utf-8 in an old-style
 c char array -- you can still passi t around and copare it, even if the
 bytes dont mean what you think they do.

It round trips okay as long as you don't try to do anything else with
the string. So does the textarray class I proposed in a new thread: If
you just use fromfile and tofile it works fine for any input (except
for trailing nulls) but if you try to decode invalid bytes it will
throw errors. It wouldn't be hard to add configurable error-handling
there either.


Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion