[Numpy-discussion] One-byte string dtype: third time's the charm?
The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? A key consequence of not having a one-byte string dtype is that handling ASCII data stored in binary formats such as HDF or FITS is basically broken in Python 3. Packages like h5py, pytables, and astropy.io.fits all return text data arrays with the numpy 'S' type, and in fact have no direct support for the numpy wide unicode 'U' type. In Python 3, the 'S' type array cannot be compared with the Python str type, so that something like below fails: mask = (names_array == john) # FAIL Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit. For a good top-level summary of much of the previous thread discussion, see [5] from Chris Barker. Condensing this down to just a few points: - *Changing* the behavior of the existing 'S' type is going to break code and seems a bad idea. - *Adding* a new dtype 's' will work and allow highly performant conversion from 'S' to 's' via view(). - Using the latin-1 encoding will minimize code breakage vis-a-vis what works in Python 2 [6]. Using latin-1 is a pragmatic compromise that provides continuity to allow scientists to run their existing code in Python 3 and have things just work. It isn't perfect and it should not be the end of the story, but it would be good. This single issue is the *only* thing blocking me and my team from using Python 3 in operations. As a final point, I don't know the numpy internals at all, but it *seems* like this proposal is one of the easiest to implement amongst those that were discussed. Cheers, Tom [1]: http://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html [2]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html [3]: https://github.com/astropy/astropy/issues/3311 [4]: http://astropy.readthedocs.org/en/latest/api/astropy.table.Table.html#astropy.table.Table.convert_bytestring_to_unicode [5]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070631.html [6]: It is not uncommon to store uint8 data in a bytestring ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On 22/02/15 19:21, Aldcroft, Thomas wrote: Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit. Why UCS-4? The Python's internal flexible string respresentation will use ascii for ascii text. By PEP 393 an application should not assume an internal string representation at all: https://www.python.org/dev/peps/pep-0393/ If the problem is PEP 393 violation in NumPy string or unicode dtype, we shouldn't violate it even further by adding a latin-1 encoded ascii string. We should let Python represent strings as it wants, and it will not bloat. I am m -1 on adding latin-1 and +1 on making the unicode dtype PEP 393 compliant if it is not. And on Python 3 'U' and 'S' should just be synonyms. You can also store an array of bytes with uint8. Then you can decode it however you like to make a Python string. If it is encoded as latin-1 then decode it as latin-1: In [1]: import numpy as np In [2]: ascii_bytestr = The quick brown fox jumps over the lazy dog.encode('latin-1') In [3]: numpy_bytestr = np.array(memoryview(ascii_bytestr)) In [4]: numpy_bytestr.dtype, numpy_bytestr.shape Out[4]: (dtype('uint8'), (43,)) In [5]: bytes(numpy_bytestr).decode('latin-1') Out[5]: 'The quick brown fox jumps over the lazy dog' Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 11:29 AM, Sturla Molden sturla.mol...@gmail.com wrote: On 22/02/15 19:21, Aldcroft, Thomas wrote: Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit. Why UCS-4? The Python's internal flexible string respresentation will use ascii for ascii text. This is a discussion about how strings are represented as bit-patterns inside ndarrays; the internal storage representation used by 'str' is irrelevant. -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 7:29 PM, Sturla Molden sturla.mol...@gmail.com wrote: On 22/02/15 19:21, Aldcroft, Thomas wrote: Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit. Why UCS-4? The Python's internal flexible string respresentation will use ascii for ascii text. numpy's 'U' dtype is UCS-4, and this is what Thomas is referring to, not Python's string type. It cannot have a flexible representation as it *is* the representation. Python 3's `str` type is opaque, so it can freely choose how to represent the data in memory. numpy dtypes transparently describe how the data is represented in memory. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On 22/02/15 21:04, Robert Kern wrote: Python 3's `str` type is opaque, so it can freely choose how to represent the data in memory. numpy dtypes transparently describe how the data is represented in memory. Hm, yes, that is a good point. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 2:46 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the annoying copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have). - Tom The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems. float16 and rational probably aren't too relevant because they are fixed-size types, and variable-size dtypes are much trickier. datetime64 will be more similar, but also add its own irrelevant complexities -- you might be best off just looking at how S and U work and copying them. -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the annoying copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have). - Tom The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems. OK I'll have a look at those. On second thought.. Maybe I'm being naive, but I think that starting from scratch looking at entirely new dtypes is harder than it needs to be, or at least not the most straightforward path [EDIT: just saw email from Nathan agreeing here]. What is being proposed is essentially: - For Python 2, the 's' type is exactly a clone of 'S'. In other words 's' will interface with Python as a bytes (aka str) object just like 'S'. - For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in all operations, but interfaces with Python as a latin-1 encoded string. So the only difference is at the interface layer with Python (initialization, comparison, iteration, etc). So as a starting point we would want to clone 'S' to 's', then fix up the interface to Python 3. Does that sound about right? - Tom Thanks, Tom Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Feb 22, 2015 3:39 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the annoying copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have). - Tom The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems. OK I'll have a look at those. On second thought.. Maybe I'm being naive, but I think that starting from scratch looking at entirely new dtypes is harder than it needs to be, or at least not the most straightforward path [EDIT: just saw email from Nathan agreeing here]. What is being proposed is essentially: - For Python 2, the 's' type is exactly a clone of 'S'. In other words 's' will interface with Python as a bytes (aka str) object just like 'S'. - For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in all operations, but interfaces with Python as a latin-1 encoded string. So the only difference is at the interface layer with Python (initialization, comparison, iteration, etc). So as a starting point we would want to clone 'S' to 's', then fix up the interface to Python 3. Does that sound about right? Sounds reasonable to me. You'll also want to consider interactions between the dtypes -- mixed operations like array(a, dtype=s) == array(a, dtype=U) should do the right thing, and casting s-U ditto. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the annoying copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have). - Tom -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the annoying copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have). - Tom The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 2:42 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Feb 22, 2015 at 12:52 PM, Nathaniel Smith n...@pobox.com wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. We're also running out of letters for types. We need to decide on how to extend that representation. It would seem straight forward to just start using multiple letters, but there is a lot of code the uses things like `for dt in 'efdg':`. Can we perhaps introduce an extended dtype structure, maybe with some ideas from dynd and versioning. I don't mind using s for this particular case, but in general I think we should de-emphasise the string representations, and even allow new dtypes to forgo them entirely. We have all of Python to work with. It's much nicer for users and for us to write things like dtype=np.someclass(special_option=True) instead of dtype=SC[S_O=T] or whatever weird ad-hoc syntax we come up with. (Obviously there are some details to work out with things like the .npy format, but these seem solveable.) -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. -n -- Nathaniel J. Smith -- http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On 22/02/15 20:57, Nathaniel Smith wrote: This is a discussion about how strings are represented as bit-patterns inside ndarrays; the internal storage representation used by 'str' is irrelevant. I thought it would be clever to just use the same internal representation as Python would choose. But obviously it is not. UTF-8 would fail because it is not regularly stored. And every string in an ndarray will need to have the same encoding, but Python might think otherwise. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 12:52 PM, Nathaniel Smith n...@pobox.com wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. We're also running out of letters for types. We need to decide on how to extend that representation. It would seem straight forward to just start using multiple letters, but there is a lot of code the uses things like `for dt in 'efdg':`. Can we perhaps introduce an extended dtype structure, maybe with some ideas from dynd and versioning. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] One-byte string dtype: third time's the charm?
On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith n...@pobox.com wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is just :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the annoying copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have). - Tom The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems. OK I'll have a look at those. Thanks, Tom Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] np.nonzero behavior with multidimensional arrays
This was raised in SO today: http://stackoverflow.com/questions/28663142/why-is-np-wheres-result-read-only-for-multi-dimensional-arrays/28664009 np.nonzero (and np.where for boolean arrays) behave differently for 1-D and higher dimensional arrays: In the first case, a tuple with a single behaved base ndarray is returned: a = np.ma.array(range(6)) np.where(a 3) (array([4, 5]),) np.where(a 3)[0].flags C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True UPDATEIFCOPY : False In the second, a tuple with as many arrays as dimensions in the passed array is returned, but the arrays are not base ndarrays, but of the same subtype as was passed to the function. These arrays are also set as non-writeable: np.where(a.reshape(2, 3) 3) (masked_array(data = [1 1], mask = False, fill_value = 99) , masked_array(data = [1 2], mask = False, fill_value = 99) ) np.where(a.reshape(2, 3) 3)[0].flags C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False WRITEABLE : False ALIGNED : True UPDATEIFCOPY : False I can't think of any reason that justifies this difference, and believe they should be made to return similar results. My feeling is that the proper behavior is the 1-D one, and that the behavior for multidimensional arrays should match it. Anyone can think of any reason that justifies the current behavior? Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion