Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: On Jan 22, 2014, at 1:13 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: It's not safe to stop removing the null bytes. This is how numpy determines the length of the strings in a dtype='S' array. The strings are not fixed-width but rather have a maximum width. Exactly--but folks have told us on this list that they want (and are) using the 'S' style for arbitrary bytes, NOT for text. In which case you wouldn't want to remove null bytes. This is more evidence that 'S' was designed to handle c-style one-byte-per-char strings, and NOT arbitrary bytes, and thus not to map directly to the py2 string type (you can store null bytes in a py2 string You can store null bytes in a Py2 string but you normally wouldn't if it was supposed to be text. Which brings me back to my original proposal: properly map the 'S' type to the py3 data model, and maybe add some kind of fixed width bytes style of there is a use case for that. I still have no idea what the use case might be. There would definitely be a use case for a fixed-byte-width bytes-representing-text dtype in record arrays to read from a binary file: dt = np.dtype([ ('name', '|b8:utf-8'), ('param1', 'i4'), ('param2', 'i4') ... ]) with open('binaryfile', 'rb') as fin: a = np.fromfile(fin, dtype=dt) You could also use this for ASCII if desired. I don't think it really matters that utf-8 uses variable width as long as a too long byte string throws an error (and does not truncate). For non 8-bit encodings there would have to be some way to handle endianness without a BOM, but otherwise I think that it's always possible to pad with zero *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip null *characters* after decoding. i.e.: $ cat tmp.py import encodings def test_encoding(s1, enc): b = s1.encode(enc).ljust(32, b'\0') s2 = b.decode(enc) index = s2.find('\0') if index != -1: s2 = s2[:index] assert s1 == s2, enc encodings_set = set(encodings.aliases.aliases.values()) for N, enc in enumerate(encodings_set): try: test_encoding('qwe', enc) except LookupError: pass print('Tested %d encodings without error' % N) $ python3 tmp.py Tested 88 encodings without error If the trailing nulls are not removed then you would get: a[0] b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' len(a[0]) 9 And I'm sure that someone would get upset about that. Only if they are using it for text-which you should not do with py3. But people definitely are using it for text on Python 3. It should be deprecated in favour of something new but breaking it is just gratuitous. Numpy doesn't have the option to make a clean break with Python 3 precisely because it needs to straddle 2.x and 3.x while numpy-based applications are ported to 3.x. Some more oddities: a[0] = 1 a array([b'1', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') a[0] = None a array([b'None', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') More evidence that this is a text type. And the big one: $ python3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import numpy as np a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings a array([b'asd', b'zxc'], dtype='|S3') a[0] = 'qwer' # Unicode string again a array([b'qwe', b'zxc'], dtype='|S3') a[0] = 'Õscar' Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) The analogous behaviour was very deliberately removed from Python 3: a[0] == 'qwe' False a[0] == b'qwe' True Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] IRR
Hey all, We have a PR languishing that fixes np.irr to handle negative rate-of-returns: https://github.com/numpy/numpy/pull/4210 I don't even know what IRR stands for, and it seems rather confusing from the discussion there. Anyone who knows something about the issues is invited to speak up... -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: On Jan 22, 2014, at 1:13 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: It's not safe to stop removing the null bytes. This is how numpy determines the length of the strings in a dtype='S' array. The strings are not fixed-width but rather have a maximum width. Exactly--but folks have told us on this list that they want (and are) using the 'S' style for arbitrary bytes, NOT for text. In which case you wouldn't want to remove null bytes. This is more evidence that 'S' was designed to handle c-style one-byte-per-char strings, and NOT arbitrary bytes, and thus not to map directly to the py2 string type (you can store null bytes in a py2 string You can store null bytes in a Py2 string but you normally wouldn't if it was supposed to be text. Which brings me back to my original proposal: properly map the 'S' type to the py3 data model, and maybe add some kind of fixed width bytes style of there is a use case for that. I still have no idea what the use case might be. There would definitely be a use case for a fixed-byte-width bytes-representing-text dtype in record arrays to read from a binary file: dt = np.dtype([ ('name', '|b8:utf-8'), ('param1', 'i4'), ('param2', 'i4') ... ]) with open('binaryfile', 'rb') as fin: a = np.fromfile(fin, dtype=dt) You could also use this for ASCII if desired. I don't think it really matters that utf-8 uses variable width as long as a too long byte string throws an error (and does not truncate). For non 8-bit encodings there would have to be some way to handle endianness without a BOM, but otherwise I think that it's always possible to pad with zero *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip null *characters* after decoding. i.e.: $ cat tmp.py import encodings def test_encoding(s1, enc): b = s1.encode(enc).ljust(32, b'\0') s2 = b.decode(enc) index = s2.find('\0') if index != -1: s2 = s2[:index] assert s1 == s2, enc encodings_set = set(encodings.aliases.aliases.values()) for N, enc in enumerate(encodings_set): try: test_encoding('qwe', enc) except LookupError: pass print('Tested %d encodings without error' % N) $ python3 tmp.py Tested 88 encodings without error If the trailing nulls are not removed then you would get: a[0] b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' len(a[0]) 9 And I'm sure that someone would get upset about that. Only if they are using it for text-which you should not do with py3. But people definitely are using it for text on Python 3. It should be deprecated in favour of something new but breaking it is just gratuitous. Numpy doesn't have the option to make a clean break with Python 3 precisely because it needs to straddle 2.x and 3.x while numpy-based applications are ported to 3.x. Some more oddities: a[0] = 1 a array([b'1', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') a[0] = None a array([b'None', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') More evidence that this is a text type. And the big one: $ python3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import numpy as np a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings a array([b'asd', b'zxc'], dtype='|S3') a[0] = 'qwer' # Unicode string again a array([b'qwe', b'zxc'], dtype='|S3') a[0] = 'Õscar' Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) The analogous behaviour was very deliberately removed from Python 3: a[0] == 'qwe' False a[0] == b'qwe' True Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: On Jan 22, 2014, at 1:13 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: It's not safe to stop removing the null bytes. This is how numpy determines the length of the strings in a dtype='S' array. The strings are not fixed-width but rather have a maximum width. Exactly--but folks have told us on this list that they want (and are) using the 'S' style for arbitrary bytes, NOT for text. In which case you wouldn't want to remove null bytes. This is more evidence that 'S' was designed to handle c-style one-byte-per-char strings, and NOT arbitrary bytes, and thus not to map directly to the py2 string type (you can store null bytes in a py2 string You can store null bytes in a Py2 string but you normally wouldn't if it was supposed to be text. Which brings me back to my original proposal: properly map the 'S' type to the py3 data model, and maybe add some kind of fixed width bytes style of there is a use case for that. I still have no idea what the use case might be. There would definitely be a use case for a fixed-byte-width bytes-representing-text dtype in record arrays to read from a binary file: dt = np.dtype([ ('name', '|b8:utf-8'), ('param1', 'i4'), ('param2', 'i4') ... ]) with open('binaryfile', 'rb') as fin: a = np.fromfile(fin, dtype=dt) You could also use this for ASCII if desired. I don't think it really matters that utf-8 uses variable width as long as a too long byte string throws an error (and does not truncate). For non 8-bit encodings there would have to be some way to handle endianness without a BOM, but otherwise I think that it's always possible to pad with zero *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip null *characters* after decoding. i.e.: $ cat tmp.py import encodings def test_encoding(s1, enc): b = s1.encode(enc).ljust(32, b'\0') s2 = b.decode(enc) index = s2.find('\0') if index != -1: s2 = s2[:index] assert s1 == s2, enc encodings_set = set(encodings.aliases.aliases.values()) for N, enc in enumerate(encodings_set): try: test_encoding('qwe', enc) except LookupError: pass print('Tested %d encodings without error' % N) $ python3 tmp.py Tested 88 encodings without error If the trailing nulls are not removed then you would get: a[0] b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' len(a[0]) 9 And I'm sure that someone would get upset about that. Only if they are using it for text-which you should not do with py3. But people definitely are using it for text on Python 3. It should be deprecated in favour of something new but breaking it is just gratuitous. Numpy doesn't have the option to make a clean break with Python 3 precisely because it needs to straddle 2.x and 3.x while numpy-based applications are ported to 3.x. Some more oddities: a[0] = 1 a array([b'1', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') a[0] = None a array([b'None', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') More evidence that this is a text type. And the big one: $ python3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import numpy as np a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings a array([b'asd', b'zxc'], dtype='|S3') a[0] = 'qwer' # Unicode string again a array([b'qwe', b'zxc'], dtype='|S3') a[0] = 'Õscar' Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) looks mostly like casting rules to me, which looks like ASCII based instead of an arbitrary encoding. a = np.array(['asd', 'zxc'], dtype='S') b = a.astype('U') b[0] = 'Õscar' a[0] = 'Õscar' Traceback (most recent call last): File pyshell#17, line 1, in module a[0] = 'Õscar' UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) b array(['Õsc', 'zxc'], dtype='U3') b.astype('S') Traceback (most recent call last): File pyshell#19, line 1, in module b.astype('S') UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) b.view('S4') array([b'\xd5', b's', b'c', b'z', b'x', b'c'], dtype='|S4') a.astype('U').astype('S') array([b'asd', b'zxc'], dtype='|S3') Josef The analogous behaviour was very deliberately removed from Python 3: a[0] == 'qwe' False a[0] == b'qwe' True Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 10:41 AM, josef.p...@gmail.com wrote: On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: On Jan 22, 2014, at 1:13 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: It's not safe to stop removing the null bytes. This is how numpy determines the length of the strings in a dtype='S' array. The strings are not fixed-width but rather have a maximum width. Exactly--but folks have told us on this list that they want (and are) using the 'S' style for arbitrary bytes, NOT for text. In which case you wouldn't want to remove null bytes. This is more evidence that 'S' was designed to handle c-style one-byte-per-char strings, and NOT arbitrary bytes, and thus not to map directly to the py2 string type (you can store null bytes in a py2 string You can store null bytes in a Py2 string but you normally wouldn't if it was supposed to be text. Which brings me back to my original proposal: properly map the 'S' type to the py3 data model, and maybe add some kind of fixed width bytes style of there is a use case for that. I still have no idea what the use case might be. There would definitely be a use case for a fixed-byte-width bytes-representing-text dtype in record arrays to read from a binary file: dt = np.dtype([ ('name', '|b8:utf-8'), ('param1', 'i4'), ('param2', 'i4') ... ]) with open('binaryfile', 'rb') as fin: a = np.fromfile(fin, dtype=dt) You could also use this for ASCII if desired. I don't think it really matters that utf-8 uses variable width as long as a too long byte string throws an error (and does not truncate). For non 8-bit encodings there would have to be some way to handle endianness without a BOM, but otherwise I think that it's always possible to pad with zero *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip null *characters* after decoding. i.e.: $ cat tmp.py import encodings def test_encoding(s1, enc): b = s1.encode(enc).ljust(32, b'\0') s2 = b.decode(enc) index = s2.find('\0') if index != -1: s2 = s2[:index] assert s1 == s2, enc encodings_set = set(encodings.aliases.aliases.values()) for N, enc in enumerate(encodings_set): try: test_encoding('qwe', enc) except LookupError: pass print('Tested %d encodings without error' % N) $ python3 tmp.py Tested 88 encodings without error If the trailing nulls are not removed then you would get: a[0] b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' len(a[0]) 9 And I'm sure that someone would get upset about that. Only if they are using it for text-which you should not do with py3. But people definitely are using it for text on Python 3. It should be deprecated in favour of something new but breaking it is just gratuitous. Numpy doesn't have the option to make a clean break with Python 3 precisely because it needs to straddle 2.x and 3.x while numpy-based applications are ported to 3.x. Some more oddities: a[0] = 1 a array([b'1', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') a[0] = None a array([b'None', b'string', b'of', b'different', b'length', b'words'], dtype='|S9') More evidence that this is a text type. And the big one: $ python3 Python 3.2.3 (default, Sep 25 2013, 18:22:43) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import numpy as np a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings a array([b'asd', b'zxc'], dtype='|S3') a[0] = 'qwer' # Unicode string again a array([b'qwe', b'zxc'], dtype='|S3') a[0] = 'Õscar' Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) looks mostly like casting rules to me, which looks like ASCII based instead of an arbitrary encoding. a = np.array(['asd', 'zxc'], dtype='S') b = a.astype('U') b[0] = 'Õscar' a[0] = 'Õscar' Traceback (most recent call last): File pyshell#17, line 1, in module a[0] = 'Õscar' UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) b array(['Õsc', 'zxc'], dtype='U3') b.astype('S') Traceback (most recent call last): File pyshell#19, line 1, in module b.astype('S') UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) b.view('S4') array([b'\xd5', b's', b'c', b'z', b'x', b'c'], dtype='|S4') a.astype('U').astype('S') array([b'asd', b'zxc'], dtype='|S3') another curious example, encode utf-8 to latin-1 bytes b array(['Õsc', 'zxc'], dtype='U3') b[0].encode('utf8') b'\xc3\x95sc' b[0].encode('latin1') b'\xd5sc' b.astype('S') Traceback (most recent
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.p...@gmail.com wrote: another curious example, encode utf-8 to latin-1 bytes b array(['Õsc', 'zxc'], dtype='U3') b[0].encode('utf8') b'\xc3\x95sc' b[0].encode('latin1') b'\xd5sc' b.astype('S') Traceback (most recent call last): File pyshell#40, line 1, in module b.astype('S') UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) c = b.view('S4').astype('S1').view('S3') c array([b'\xd5sc', b'zxc'], dtype='|S3') c[0].decode('latin1') 'Õsc' Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses ascii: np.array(['Õsc']).astype('S4') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) np.array(['Õsc']).view('S4') array([b'\xd5', b's', b'c'], dtype='|S4') No, a view doesn't change the memory, it just changes the interpretation and there shouldn't be any conversion involved. astype does type conversion, but it goes through ascii encoding which fails. b = np.array(['Õsc', 'zxc'], dtype='U3') b.tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' b.view('S12') array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], dtype='|S12') The conversion happens somewhere in the array creation, but I have no idea about the memory encoding for uc2 and the low level layouts. Josef The original numpy py3 conversion used latin-1 as default (It's still used in statsmodels, and I haven't looked at the structure under the common py2-3 codebase) if sys.version_info[0] = 3: import io bytes = bytes unicode = str asunicode = str These two functions are an abomination: def asbytes(s): if isinstance(s, bytes): return s return s.encode('latin1') def asstr(s): if isinstance(s, str): return s return s.decode('latin1') Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 11:58 AM, josef.p...@gmail.com wrote: On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.p...@gmail.com wrote: another curious example, encode utf-8 to latin-1 bytes b array(['Õsc', 'zxc'], dtype='U3') b[0].encode('utf8') b'\xc3\x95sc' b[0].encode('latin1') b'\xd5sc' b.astype('S') Traceback (most recent call last): File pyshell#40, line 1, in module b.astype('S') UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) c = b.view('S4').astype('S1').view('S3') c array([b'\xd5sc', b'zxc'], dtype='|S3') c[0].decode('latin1') 'Õsc' Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() uses ascii: np.array(['Õsc']).astype('S4') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) np.array(['Õsc']).view('S4') array([b'\xd5', b's', b'c'], dtype='|S4') No, a view doesn't change the memory, it just changes the interpretation and there shouldn't be any conversion involved. astype does type conversion, but it goes through ascii encoding which fails. b = np.array(['Õsc', 'zxc'], dtype='U3') b.tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' b.view('S12') array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], dtype='|S12') The conversion happens somewhere in the array creation, but I have no idea about the memory encoding for uc2 and the low level layouts. utf8 encoded bytes a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S') a array([b'\xc3\x95sc', b'zxc'], dtype='|S4') a.tostring() b'\xc3\x95sczxc\x00' a.view('S8') array([b'\xc3\x95sczxc'], dtype='|S8') a[0].decode('latin1') 'Ã\x95sc' a[0].decode('utf8') 'Õsc' Josef Josef The original numpy py3 conversion used latin-1 as default (It's still used in statsmodels, and I haven't looked at the structure under the common py2-3 codebase) if sys.version_info[0] = 3: import io bytes = bytes unicode = str asunicode = str These two functions are an abomination: def asbytes(s): if isinstance(s, bytes): return s return s.encode('latin1') def asstr(s): if isinstance(s, str): return s return s.decode('latin1') Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] cannot decode 'S'
truncating null bytes in 'S' breaks decoding that needs them a = np.array([si.encode('utf-16LE') for si in ['Õsc', 'zxc']], dtype='S') a array([b'\xd5\x00s\x00c', b'z\x00x\x00c'], dtype='|S6') [ai.decode('utf-16LE') for ai in a] Traceback (most recent call last): File pyshell#118, line 1, in module [ai.decode('utf-16LE') for ai in a] File pyshell#118, line 1, in listcomp [ai.decode('utf-16LE') for ai in a] File C:\Programs\Python33\lib\encodings\utf_16_le.py, line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode byte 0x63 in position 4: truncated data messy workaround (arrays in contrast to scalars are not truncated in `tostring`) [a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))] ['Õsc', 'zxc'] Found while playing with examples in the other thread. Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On 23 January 2014 17:42, josef.p...@gmail.com wrote: On Thu, Jan 23, 2014 at 12:13 PM, josef.p...@gmail.com wrote: On Thu, Jan 23, 2014 at 11:58 AM, josef.p...@gmail.com wrote: No, a view doesn't change the memory, it just changes the interpretation and there shouldn't be any conversion involved. astype does type conversion, but it goes through ascii encoding which fails. b = np.array(['Õsc', 'zxc'], dtype='U3') b.tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' b.view('S12') array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], dtype='|S12') The conversion happens somewhere in the array creation, but I have no idea about the memory encoding for uc2 and the low level layouts. b = np.array(['Õsc', 'zxc'], dtype='U3') b[0].tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' 'Õsc'.encode('utf-32LE') b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' Is that the encoding for 'U' ? On a little-endian system, yes. I realise what' happening now. 'U' represents unicode characters as a 32-bit unsigned integer giving the code point of the character. The first 256 code points are exactly the 256 characters representable with latin-1 in the same order. So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in latin-1. As a 32 bit integer the code point is 0x00d5 but in little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So when you reinterpret that as 'S4' it strips the remaining nulls to get the byte string b'\xd5'. Which is the latin-1 encoding for the character. The same will happen for any string of latin-1 characters. However if you do have a code point of 256 or greater then you'll get a byte strings of length 2 or more. On a big-endian system I think you'd get b'\x00\x00\x00\xd5'. another sideeffect of null truncation: cannot decode truncated data b.view('S4').tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' b.view('S4')[0] b'\xd5' b.view('S4')[0].tostring() b'\xd5' b.view('S4')[:1].tostring() b'\xd5\x00\x00\x00' b.view('S4')[0].decode('utf-32LE') Traceback (most recent call last): File pyshell#101, line 1, in module b.view('S4')[0].decode('utf-32LE') File C:\Programs\Python33\lib\encodings\utf_32_le.py, line 11, in decode return codecs.utf_32_le_decode(input, errors, True) UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position 0: truncated data b.view('S4')[:1].tostring().decode('utf-32LE') 'Õ' numpy arrays need a decode and encode method I'm not sure that they do. Rather there needs to be a text dtype that knows what encoding to use in order to have a binary interface as exposed by .tostring() and friends and but produce unicode strings when indexed from Python code. Having both a text and a binary interface to the same data implies having an encoding. Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
Thanks for poking into this all. I've lost track a bit, but I think: The 'S' type is clearly broken on py3 (at least). I think that gives us room to change it, and backward compatibly is less of an issue because it's broken already -- do we need to preserve bug-for-bug compatibility? Maybe, but I suspect in this case, not -- the code the works fine on py3 with the 'S' type is probably only lucky that it hasn't encountered the issues yet. And no matter how you slice it, code being ported to py3 needs to deal with text handling issues. But here is where we stand: The 'S' dtype: - was designed for one-byte-per-char text data. - was mapped to the py2 string type. - used the classic C null-terminated approach. - can be used for arbitrary bytes (as the py2 string type can), but not quite, as it truncates null bytes -- so it really a bad idea to use it that way. Under py3: The 'S' type maps to the py3 bytes type, because that's the closest to the py2 string type. But it also does some inconsistent things with encoding, and does treat a lot of other things as text. But the py3 bytes type does not have the same text handling as the py2 string type, so things like: s = 'a string' np.array((s,), dtype='S')[0] == s Gives you False, rather than True on py2. This is because a py3 string is translated to the 'S' type (presumable with the default encoding, another maybe not a good idea, but returns a bytes object, which does not compare true to a py3 string. YOu can work aroudn this with varios calls to encode() and decode, and/or using b'a string', but that is ugly, kludgy, and doesn't work well with the py3 text model. The py2 = py3 transition separated bytes and strings: strings are unicode, and bytes are not to be used for text (directly). While there is some text-related functionality still in bytes, the core devs are quite clear that that is for special cases only, and not for general text processing. I don't think numpy should fight this, but rather embrace the py3 text model. The most natural way to do that is to use the existing 'U' dtype for text. Really the best solution for most cases. (Like the above case) However, there is a use case for a more efficient way to deal with text. There are a couple ways to go about that that have been brought up here: 1: have a more efficient unicode dtype: variable length, multiple encoding options, etc - This is a fine idea that would support better text handling in numpy, and _maybe_ better interaction with external libraries (HDF, etc...) 2: Have a one-byte-per-char text dtype: - This would be much easier to implement fit into the current numpy model, and satisfy a lot of common use cases for scientific data sets. We could certainly do both, but I'd like to see (2) get done sooner than later A related issue is whether numpy needs a dtype analogous to py3 bytes -- I'm still not sure of the use-case there, so can't comment -- would it need to be fixed length (fitting into the numpy data model better) or variable length, or ??? Some folks are (apparently) using the current 'S' type in this way, but I think that's ripe for errors, due to the null bytes issue. Though maybe there is a null-bytes-are-special binary format that isn't text -- I have no idea. So what do we do with 'S'? It really is pretty broken, so we have a couple choices: (1) depricate it, so that it stays around for backward compatibility but encourage people to either use 'U' for text, or one of the new dtypes that are yet to be implemented (maybe 's' for a one-byte-per-char dtype), and use either uint8 or the new bytes dtype that is yet to be implemented. (2) fix it -- in this case, I think we need to be clear what it is: -- A one-byte-char-text type? If so, it should map to a py3 string, and have a defined encoding (ascii or latin-1, probably), or even better a settable encoding (but only for one-byte-per-char encodings -- I don't think utf-8 is a good idea here, as a utf-8 encoded string is of unknown length. (there is some room for debate here, as the 'S' type is fixed length and truncates anyway, maybe it's fine for it to truncate utf-8 -- as long as it doesn't partially truncate in teh middle of a charactor) -- a bytes type? in which case, we should clean out all teh automatic conversion to-from text that iare in it now. I vote for it being our one-byte text type -- it almost is already, and it would make the easiest transition for folks from py2 to py3. But backward compatibility is backward compatibility. numpy arrays need a decode and encode method I'm not sure that they do. Rather there needs to be a text dtype that knows what encoding to use in order to have a binary interface as exposed by .tostring() and friends and but produce unicode strings when indexed from Python code. Having both a text and a binary interface to the same data implies having an encoding. I agree with Oscar here -- let's not conflate encode and decoded data --
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 1:49 PM, Chris Barker chris.bar...@noaa.gov wrote: s = 'a string' np.array((s,), dtype='S')[0] == s Gives you False, rather than True on py2. This is because a py3 string is translated to the 'S' type (presumable with the default encoding, another maybe not a good idea, but returns a bytes object, which does not compare true to a py3 string. YOu can work aroudn this with varios calls to encode() and decode, and/or using b'a string', but that is ugly, kludgy, and doesn't work well with the py3 text model. I think this is just inconsistent casting rules in numpy, numpy should either refuse to assign the wrong type, instead of using the repr as in some of the earlier examples of Oscar s = np.inf np.array((s,), dtype=int)[0] == s Traceback (most recent call last): File pyshell#126, line 1, in module np.array((s,), dtype=int)[0] == s OverflowError: cannot convert float infinity to integer or use the **same** conversion/casting rules also during the interaction with python as are used in assignments and array creation. Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] cannot decode 'S'
Josef, Nice find -- another reason why 'S' can NOT be used a-is for arbitrary bytes. See the other thread for my proposals about that. messy workaround (arrays in contrast to scalars are not truncated in `tostring`) [a[i:i+1].tostring().decode('utf-16LE') for i in range(len(a))] ['Õsc', 'zxc'] I think the real work around is to not try to store arbitrary bytes -- i.e. encoded text, in the 'S' dtype. But is there a convenient way to do it with other existing numpy types? I tried to do it with uint8, and it's really awkward -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 11:18 AM, josef.p...@gmail.com wrote: I think this is just inconsistent casting rules in numpy, numpy should either refuse to assign the wrong type, instead of using the repr as in some of the earlier examples of Oscar s = np.inf np.array((s,), dtype=int)[0] == s Traceback (most recent call last): File pyshell#126, line 1, in module np.array((s,), dtype=int)[0] == s OverflowError: cannot convert float infinity to integer or use the **same** conversion/casting rules also during the interaction with python as are used in assignments and array creation. Exactly -- but what should those conversion/casting rules be? We can't decide that unless we decide if 'S' is for text or for arbitrary bytes -- it can't be both. I say text, that's what it's mostly trying to do already. But if it's bytes, fine, then some things still need cleaning up, and we could really use a one-byte-text type. and if it's text, then we may need a bytes dtype. Key here is that we don't have the option of not breaking anything, because there is a lot already broken. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
numpy arrays need a decode and encode method I'm not sure that they do. Rather there needs to be a text dtype that knows what encoding to use in order to have a binary interface as exposed by .tostring() and friends and but produce unicode strings when indexed from Python code. Having both a text and a binary interface to the same data implies having an encoding. I agree with Oscar here -- let's not conflate encode and decoded data -- the py3 text model is a fine one, we should work with it as much as practical. UNLESS: if we do add a bytes dtype, then it would be a reasonable use case to use it to store encoded text (just like the py3 bytes types), in which case it would be good to have encode() and decode() methods or ufuncs -- probably ufuncs. But that should be for special purpose, at the I/O interface kind of stuff. I think we need both things changing the memory and changing the view. The same way we can convert between int and float and complex (trunc, astype, real, ...) we should be able to convert between bytes and any string (text) dtypes, i.e. decode and encode. I'm reading a file in binary and then want to convert it to unicode, only I realize I have only ascii and want to convert to something less memory hungry. views don't care about what the content means, it just has to be memory compatible, I can view anything as an 'S' or a 'uint' (I think). What we currently don't have is a string/text view on S that would interact with python as string. (that's a vote in favor of a minimal one char string dtype that would work for a limited number of encodings.) Josef ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 2:45 PM, Chris Barker chris.bar...@noaa.gov wrote: On Thu, Jan 23, 2014 at 11:18 AM, josef.p...@gmail.com wrote: I think this is just inconsistent casting rules in numpy, numpy should either refuse to assign the wrong type, instead of using the repr as in some of the earlier examples of Oscar s = np.inf np.array((s,), dtype=int)[0] == s Traceback (most recent call last): File pyshell#126, line 1, in module np.array((s,), dtype=int)[0] == s OverflowError: cannot convert float infinity to integer or use the **same** conversion/casting rules also during the interaction with python as are used in assignments and array creation. Exactly -- but what should those conversion/casting rules be? We can't decide that unless we decide if 'S' is for text or for arbitrary bytes -- it can't be both. I say text, that's what it's mostly trying to do already. But if it's bytes, fine, then some things still need cleaning up, and we could really use a one-byte-text type. and if it's text, then we may need a bytes dtype. (remember I'm just a balcony muppet) As far as I understand all codecs have the same ascii part. So I would cast on ascii and raise on anything else. or follow whatever the convention of numpy is: s = -256 np.array((s,), dtype=np.uint8)[0] == s False s = -1 np.array((s,), dtype=np.uint8)[0] == s False Josef Key here is that we don't have the option of not breaking anything, because there is a lot already broken. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 1:36 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 23 January 2014 17:42, josef.p...@gmail.com wrote: On Thu, Jan 23, 2014 at 12:13 PM, josef.p...@gmail.com wrote: On Thu, Jan 23, 2014 at 11:58 AM, josef.p...@gmail.com wrote: No, a view doesn't change the memory, it just changes the interpretation and there shouldn't be any conversion involved. astype does type conversion, but it goes through ascii encoding which fails. b = np.array(['Õsc', 'zxc'], dtype='U3') b.tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' b.view('S12') array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], dtype='|S12') The conversion happens somewhere in the array creation, but I have no idea about the memory encoding for uc2 and the low level layouts. b = np.array(['Õsc', 'zxc'], dtype='U3') b[0].tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' 'Õsc'.encode('utf-32LE') b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' Is that the encoding for 'U' ? On a little-endian system, yes. I realise what' happening now. 'U' represents unicode characters as a 32-bit unsigned integer giving the code point of the character. The first 256 code points are exactly the 256 characters representable with latin-1 in the same order. So 'Õ' has the code point 0xd5 and is encoded as the byte 0xd5 in latin-1. As a 32 bit integer the code point is 0x00d5 but in little-endian format that becomes the 4 bytes 0xd5,0x00,0x00,0x00. So when you reinterpret that as 'S4' it strips the remaining nulls to get the byte string b'\xd5'. Which is the latin-1 encoding for the character. The same will happen for any string of latin-1 characters. However if you do have a code point of 256 or greater then you'll get a byte strings of length 2 or more. On a big-endian system I think you'd get b'\x00\x00\x00\xd5'. I curious consequence of this, if we have only 1 character elements: a = np.array([si.encode('utf-16LE') for si in ['Õ', 'z']], dtype='S') a32 = np.array([si.encode('utf-32LE') for si in ['Õ', 'z']], dtype='S') a[0], a32[0] (b'\xd5', b'\xd5') a[0] == a32[0] True a32 = np.array([si.encode('utf-32BE') for si in ['Õ', 'z']], dtype='S') a = np.array([si.encode('utf-16BE') for si in ['Õ', 'z']], dtype='S') a[0], a32[0] (b'\x00\xd5', b'\x00\x00\x00\xd5') a[0] == a32[0] False Josef another sideeffect of null truncation: cannot decode truncated data b.view('S4').tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' b.view('S4')[0] b'\xd5' b.view('S4')[0].tostring() b'\xd5' b.view('S4')[:1].tostring() b'\xd5\x00\x00\x00' b.view('S4')[0].decode('utf-32LE') Traceback (most recent call last): File pyshell#101, line 1, in module b.view('S4')[0].decode('utf-32LE') File C:\Programs\Python33\lib\encodings\utf_32_le.py, line 11, in decode return codecs.utf_32_le_decode(input, errors, True) UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position 0: truncated data b.view('S4')[:1].tostring().decode('utf-32LE') 'Õ' numpy arrays need a decode and encode method I'm not sure that they do. Rather there needs to be a text dtype that knows what encoding to use in order to have a binary interface as exposed by .tostring() and friends and but produce unicode strings when indexed from Python code. Having both a text and a binary interface to the same data implies having an encoding. Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Text array dtype for numpy
There have been a few threads discussing the problems of how to do text with numpy arrays in Python 3. To make a slightly more concrete proposal, I've implemented a pure Python ndarray subclass that I believe can consistently handle text/bytes in Python 3. It is intended to be an illustration since I think that the real solution is a new dtype rather than an array subclass (so that it can be used in e.g. record arrays). The idea is that the array has an encoding. It stores strings as bytes. The bytes are encoded/decoded on insertion/access. Methods accessing the binary content of the array will see the encoded bytes. Methods accessing the elements of the array will see unicode strings. I believe it would not be as hard to implement as the proposals for variable length string arrays. The one caveat is that it will strip null characters from the end of any string. I'm not 100% that the byte stripping encoding function will always work but it will for all the encodings I know and it seems to work with all the encodings that Python has. The code is inline below and attached (in case there are encoding problems with this message!): Oscar #!/usr/bin/env python3 from numpy import ndarray, array class textarray(ndarray): '''ndarray for holding encoded text. This is for demonstration purposes only. The real proposal is to specify the encoding as a dtype rather than a subclass. Only works as a 1-d array. a = textarray(['qwert', 'zxcvb'], encoding='ascii') a textarray(['qwert', 'zxcvb'], dtype='|S5:ascii') a[0] 'qwert' a.tostring() b'qwertzxcvb' a[0] = 'qwe' # shorter string a[0] 'qwe' a.tostring() b'qwe\\x00\\x00zxcvb' a[0] = 'qwertyuiop' # longer string Traceback (most recent call last): ... ValueError: Encoded bytes don't fit b = textarray(['Õscar', 'qwe'], encoding='utf-8') b textarray(['Õscar', 'qwe'], dtype='|S6:utf-8') b[0] 'Õscar' b[0].encode('utf-8') b'\\xc3\\x95scar' b.tostring() b'\\xc3\\x95scarqwe\\x00\\x00\\x00' c = textarray(['qwe'], encoding='utf-32-le') c textarray(['qwe'], dtype='|S12:utf-32-le') ''' def __new__(cls, strings, encoding='utf-8'): bytestrings = [s.encode(encoding) for s in strings] a = array(bytestrings, dtype='S').view(textarray) a.encoding = encoding return a def __repr__(self): slist = ', '.join(repr(self[n]) for n in range(len(self))) return textarray([%s], \n dtype='|S%d:%s')\ % (slist, self.itemsize, self.encoding) def __getitem__(self, index): bstring = ndarray.__getitem__(self, index) return self._decode(bstring) def __setitem__(self, index, string): bstring = string.encode(self.encoding) if len(bstring) self.itemsize: raise ValueError(Encoded bytes don't fit) ndarray.__setitem__(self, index, bstring) def _decode(self, b): b = b + b'\0' * (4 - len(b) % 4) s = b.decode(self.encoding) for n, c in enumerate(reversed(s)): if c != '\0': return s[:len(s)-n] return s if __name__ == __main__: import doctest doctest.testmod() #!/usr/bin/env python3 from numpy import ndarray, array class textarray(ndarray): '''ndarray for holding encoded text. This is for demonstration purposes only. The real proposal is to specify the encoding as a dtype rather than a subclass. Only works as a 1-d array. a = textarray(['qwert', 'zxcvb'], encoding='ascii') a textarray(['qwert', 'zxcvb'], dtype='|S5:ascii') a[0] 'qwert' a.tostring() b'qwertzxcvb' a[0] = 'qwe' # shorter string a[0] 'qwe' a.tostring() b'qwe\\x00\\x00zxcvb' a[0] = 'qwertyuiop' # longer string Traceback (most recent call last): ... ValueError: Encoded bytes don't fit b = textarray(['Õscar', 'qwe'], encoding='utf-8') b textarray(['Õscar', 'qwe'], dtype='|S6:utf-8') b[0] 'Õscar' b[0].encode('utf-8') b'\\xc3\\x95scar' b.tostring() b'\\xc3\\x95scarqwe\\x00\\x00\\x00' c = textarray(['qwe'], encoding='utf-32-le') c textarray(['qwe'], dtype='|S12:utf-32-le') ''' def __new__(cls, strings, encoding='utf-8'): bytestrings = [s.encode(encoding) for s in strings] a = array(bytestrings, dtype='S').view(textarray) a.encoding = encoding return a def __repr__(self): slist = ', '.join(repr(self[n]) for n in range(len(self))) return textarray([%s], \n dtype='|S%d:%s')\ % (slist, self.itemsize, self.encoding) def __getitem__(self, index): bstring = ndarray.__getitem__(self, index) return self._decode(bstring) def __setitem__(self, index, string):
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 12:10 PM, josef.p...@gmail.com wrote: Exactly -- but what should those conversion/casting rules be? We can't decide that unless we decide if 'S' is for text or for arbitrary bytes -- it can't be both. I say text, that's what it's mostly trying to do already. But if it's bytes, fine, then some things still need cleaning up, and we could really use a one-byte-text type. and if it's text, then we may need a bytes dtype. (remember I'm just a balcony muppet) me too ;-) As far as I understand all codecs have the same ascii part. nope -- certainly not multi-byte codecs. And one of the key points of utf-8 is that the ascii part is compatible -- none of teh other full-unicode encoding are. many of the one-byte-per-char ones do share the ascii part, but not all, or not completely. So I would cast on ascii and raise on anything else. still a fine option -- clearly defined and quite useful for scientific text. However, I would prefer latin-1 -- that way you might get garbage for the non-ascii parts, but it wouldn't raise an exception and it round-trips through encoding/decoding. And you would have a somewhat more useful subset -- including the latin-language character and symbols like the degree symbol, etc. or follow whatever the convention of numpy is: s = -256 np.array((s,), dtype=np.uint8)[0] == s False s = -1 np.array((s,), dtype=np.uint8)[0] == s False I think text is distinct enough from numbers that we don't need to do that same thing -- and this is result of well-defined casting rules built into the compiler (and hardware?) for the numeric types. I dont hink we have either the standard or compiler support for text conversions like that. -CHB PS: this is interesting, on py2: In [176]: a = np.array((,), dtype='S') In [177]: a Out[177]: array(['2'], dtype='|S1') It converts it to a string, but only grabs the first character? (is it determining the size before converting to a string? and this: In [182]: a = np.array(, dtype='S') In [183]: a Out[183]: array('', dtype='|S24') 24 ? where did that come from? Josef Key here is that we don't have the option of not breaking anything, because there is a lot already broken. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] (no subject)
Both scipy and numpy require GSOC candidates to have a pull request accepted as part of the application process. I'd suggest implementing a function not currently in scipy that you think would be useful. That would also help in finding a mentor for the summer. I'd also suggest getting familiar with cython. Chuck Thanks a lot for the heads-up. I am yet to be familiarized with Cython and it indeed is playing a crucial role especially in the 'special' module I don't see you on github yet, are you there? If not, you should set up an account to work in. See the developer guide http://docs.scipy.org/doc/numpy/dev/for some pointers. Chuck I am present on github but the profile at present is just a mark of humble mistakes of a beginner to open-sourcing, The id is https://github.com/jennystone. I hope to build upon my profile. Jennifer ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] (no subject)
Scipy doesn't have a function for the Laplace transform, it has only a Laplace distribution in scipy.stats and a Laplace filter in scipy.ndimage. An inverse Laplace transform would be very welcome I'd think - it has real world applications, and there's no good implementation in any open source library as far as I can tell. It's probably doable, but not the easiest topic for a GSoC I think. From what I can find, the paper Numerical Transform Inversion Using Gaussian Quadrature from den Iseger contains what's considered the current state of the art algorithm. Browsing that gives a reasonable idea of the difficulty of implementing `ilaplace`. A brief scanning through the paper Numerical Transform Inversion Using Gaussian Quadrature from den Iseger does indicate the complexity of the algorithm. But GSoC project or not, can't we work on it, step by step? As I would love to see a contender for Matlab's ilaplace on open source front!! You can have a look at https://github.com/scipy/scipy/pull/2908/files for ideas. Most of the things that need improving or we really think we should have in Scipy are listed there. Possible topics are not restricted to that list though - it's more important that you pick something you're interested in and have the required background and coding skills for. Thanks a lot for the roadmap. Of the options provided, I found the 'Cython'ization of Cluster great. Would it be possible to do it as the Summer project if I spend the month learning Cython? Regards Janani Cheers, Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] De Bruijn sequence
I happen to be working with De Bruijn sequences. Is there any interest in this being part of numpy/scipy? https://gist.github.com/vincentdavis/8588879 Vincent Davis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On 23 January 2014 21:51, Chris Barker chris.bar...@noaa.gov wrote: However, I would prefer latin-1 -- that way you might get garbage for the non-ascii parts, but it wouldn't raise an exception and it round-trips through encoding/decoding. And you would have a somewhat more useful subset -- including the latin-language character and symbols like the degree symbol, etc. Exceptions and error messages are a good thing! Garbage is not!!! :) Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin oscar.j.benja...@gmail.comwrote: On 23 January 2014 21:51, Chris Barker chris.bar...@noaa.gov wrote: However, I would prefer latin-1 -- that way you might get garbage for the non-ascii parts, but it wouldn't raise an exception and it round-trips through encoding/decoding. And you would have a somewhat more useful subset -- including the latin-language character and symbols like the degree symbol, etc. Exceptions and error messages are a good thing! Garbage is not!!! :) in principle, I agree with you, but sometime practicality beets purity. in py2 there is a lot of implicit encoding/decoding going on, using the system encoding. That is ascii on a lot of systems. The result is that there is a lot of code out there that folks have ported to use unicode, but missed a few corners. If that code is only testes with ascii, it all seems o be working but then out in the wild someone puts another character in there and presto -- a crash. Also, there are places where the inability to encode makes silent message -- for instance if an Exception is raised with a unicode message, it will get silently dropped when it comes time to display on the terminal. I spent quite a wile banging my head against that one recently when I tried to update some code to read unicode files. I would have been MUCH happier with a bit of garbage in the mesae than having it drop (or raise an encoding error in the middle of the error...) I think this is a bad thing. The advantage of latin-1 is that while you might get something that doesn't print right, it won't crash, and it won't contaminate the data, so comparisons, etc, will still work. kind of like using utf-8 in an old-style c char array -- you can still passi t around and copare it, even if the bytes dont mean what you think they do. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Thu, Jan 23, 2014 at 3:56 PM, josef.p...@gmail.com wrote: I'm not sure anymore, after all these threads I think bytes should be bytes and strings should be strings exactly -- that's the py3 model, and I think we really soudl try to conform to it, it's really the only way to have a robust solution. I like the idea of an `encoding_view` on some 'S' bytes, and once we have a view like that there is no reason to pretend 'S' bytes are text. right, then they are bytes, not text. period. I'm not sure if we should conflate encoded text and arbitrary bytes, but it does make sense to build encoded text on a bytes object. maybe I didn't pay attention because I didn't care, until we ran into the python 3 problems. maybe nobody else did either. yup -- I think this didn't get a whole lot of review or testing -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On 24 January 2014 01:09, Chris Barker chris.bar...@noaa.gov wrote: On Thu, Jan 23, 2014 at 4:02 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On 23 January 2014 21:51, Chris Barker chris.bar...@noaa.gov wrote: However, I would prefer latin-1 -- that way you might get garbage for the non-ascii parts, but it wouldn't raise an exception and it round-trips through encoding/decoding. And you would have a somewhat more useful subset -- including the latin-language character and symbols like the degree symbol, etc. Exceptions and error messages are a good thing! Garbage is not!!! :) in principle, I agree with you, but sometime practicality beets purity. in py2 there is a lot of implicit encoding/decoding going on, using the system encoding. That is ascii on a lot of systems. The result is that there is a lot of code out there that folks have ported to use unicode, but missed a few corners. If that code is only testes with ascii, it all seems o be working but then out in the wild someone puts another character in there and presto -- a crash. Precisely. The Py3 text model uses TypeErrors to warn early against this kind of thing. No longer do you have code that seems to work until the wrong character goes in. You get the error straight away when you try to mix bytes and text. You still have the option to silence those errors: it just needs to be done explicitly: s = 'Õscar' s.encode('ascii', errors='replace') b'?scar' Also, there are places where the inability to encode makes silent message -- for instance if an Exception is raised with a unicode message, it will get silently dropped when it comes time to display on the terminal. I spent quite a wile banging my head against that one recently when I tried to update some code to read unicode files. I would have been MUCH happier with a bit of garbage in the mesae than having it drop (or raise an encoding error in the middle of the error...) Yeah, that's just a bug in CPython. I think it's fixed now but either way you're right: for the particular case of displaying error messages the interpreter should do whatever it takes to get some kind of error message out even if it's a bit garbled. I disagree that this should be the basis for ordinary data processing with numpy though. I think this is a bad thing. The advantage of latin-1 is that while you might get something that doesn't print right, it won't crash, and it won't contaminate the data, so comparisons, etc, will still work. kind of like using utf-8 in an old-style c char array -- you can still passi t around and copare it, even if the bytes dont mean what you think they do. It round trips okay as long as you don't try to do anything else with the string. So does the textarray class I proposed in a new thread: If you just use fromfile and tofile it works fine for any input (except for trailing nulls) but if you try to decode invalid bytes it will throw errors. It wouldn't be hard to add configurable error-handling there either. Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion