Re: [Numpy-discussion] A one-byte string dtype?
On Mon, Jan 20, 2014 at 04:12:20PM -0700, Charles R Harris wrote: On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris charlesr.har...@gmail.com wrote: I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'. If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype=O for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter... The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest. This wouldn't necessarily help for the gigarows of short text strings use case (depending on what short means). Also even if it technically saves memory you may have a greater overhead from fragmenting your array all over the heap. On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory saving over dtype='U' only if the strings are 17 characters or more. To get a 50% saving over dtype='U' you'd need strings of at least 49 characters. If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters. Using utf-8 in the buffers eliminates the need to go around checking maximum code points etc. so I would guess that would be simpler to implement (CPython has now had to triple all of it's code paths that actually access the string buffer). Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On 21 Jan 2014 11:13, Oscar Benjamin oscar.j.benja...@gmail.com wrote: If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters. There are various optimisations possible as well. For ASCII strings of up to length 8, one could also use tagged pointers to eliminate the lookaside buffer entirely. (Alignment rules mean that pointers to allocated buffers always have the low bits zero; so you can make a rule that if the low bit is set to one, then this means the pointer itself should be interpreted as containing the string data; use the spare bit in the other bytes to encode the length.) In some cases it may also make sense to let identical strings share buffers, though this adds some overhead for reference counting and interning. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Creating an ndarray from an iterable over sequences
Hi, On Tue, Jan 21, 2014 at 8:34 AM, Dr. Leo fhaxbo...@googlemail.com wrote: Hi, I would like to write something like: In [25]: iterable=((i, i**2) for i in range(10)) In [26]: a=np.fromiter(iterable, int32) --- ValueErrorTraceback (most recent call last) ipython-input-26-5bcc2e94dbca in module() 1 a=np.fromiter(iterable, int32) ValueError: setting an array element with a sequence. Is there an efficient way to do this? Perhaps you could just utilize structured arrays ( http://docs.scipy.org/doc/numpy/user/basics.rec.html), like: iterable= ((i, i**2) for i in range(10)) a= np.fromiter(iterable, [('a', int32), ('b', int32)], 10) a.view(int32).reshape(-1, 2) Out[]: array([[ 0, 0], [ 1, 1], [ 2, 4], [ 3, 9], [ 4, 16], [ 5, 25], [ 6, 36], [ 7, 49], [ 8, 64], [ 9, 81]]) My 2 cents, -eat Creating two 1-dimensional arrays first is costly as one has to iterate twice over the data. So the only way I see is creating an empty [10,2] array and filling it row by row. This is memory-efficient but slow. List comprehension is vice versa. If there is no solution, wouldn't it be possible to rewrite fromiter so as to accept sequences? Leo ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Creating an ndarray from an iterable over sequences
On Tue, Jan 21, 2014 at 07:34:19AM +0100, Dr. Leo wrote: Hi, I would like to write something like: In [25]: iterable=((i, i**2) for i in range(10)) In [26]: a=np.fromiter(iterable, int32) --- ValueErrorTraceback (most recent call last) ipython-input-26-5bcc2e94dbca in module() 1 a=np.fromiter(iterable, int32) ValueError: setting an array element with a sequence. Is there an efficient way to do this? Creating two 1-dimensional arrays first is costly as one has to iterate twice over the data. So the only way I see is creating an empty [10,2] array and filling it row by row. This is memory-efficient but slow. List comprehension is vice versa. You could use itertools: from itertools import chain g = ((i, i**2) for i in range(10)) import numpy numpy.fromiter(chain.from_iterable(g), numpy.int32).reshape(-1, 2) array([[ 0, 0], [ 1, 1], [ 2, 4], [ 3, 9], [ 4, 16], [ 5, 25], [ 6, 36], [ 7, 49], [ 8, 64], [ 9, 81]], dtype=int32) Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 11:41:30AM +, Nathaniel Smith wrote: On 21 Jan 2014 11:13, Oscar Benjamin oscar.j.benja...@gmail.com wrote: If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters. There are various optimisations possible as well. For ASCII strings of up to length 8, one could also use tagged pointers to eliminate the lookaside buffer entirely. (Alignment rules mean that pointers to allocated buffers always have the low bits zero; so you can make a rule that if the low bit is set to one, then this means the pointer itself should be interpreted as containing the string data; use the spare bit in the other bytes to encode the length.) In some cases it may also make sense to let identical strings share buffers, though this adds some overhead for reference counting and interning. Would this new dtype have an opaque memory representation? What would happen in the following: a = numpy.array(['CGA', 'GAT'], dtype='s') memoryview(a) with open('file', 'wb') as fout: ... a.tofile(fout) with open('file', 'rb') as fin: ... a = numpy.fromfile(fin, dtype='s') Should there be a different function for creating such an array from reading a text file? Or would you just need to use fromiter: with open('file', encoding='utf-8') as fin: ... a = numpy.fromiter(fin, dtype='s') with open('file', encoding='utf-8') as fout: ... fout.writelines(line + '\n' for line in a) (Note that the above would not be reversible if the strings contain newlines) I think it Would be less confusing to use dtype='u' than dtype='U' in order to signify that it is an optimised form of the 'U' dtype as far as access from Python code is concerned? Calling it 's' only really makes sense if there is a plan to deprecate dtype='S'. How would it behave in Python 2? Would it return unicode strings there as well? Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Jan 20, 2014 8:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque. Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer. Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics. If someone can call buffer on an array then the FSR is a semantic change. If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa. I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface. I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'. If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype=O for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter... The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest. Is this approach intended to be in *addition to* the latin-1 s type originally proposed by Chris, or *instead of* that? - Tom snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Jan 20, 2014 8:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque. Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer. Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics. If someone can call buffer on an array then the FSR is a semantic change. If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa. I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface. I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'. If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype=O for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter... The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest. Is this approach intended to be in *addition to* the latin-1 s type originally proposed by Chris, or *instead of* that? Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Jan 20, 2014 8:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque. Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer. Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics. If someone can call buffer on an array then the FSR is a semantic change. If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa. I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface. I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'. If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype=O for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter... The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest. Is this approach intended to be in *addition to* the latin-1 s type originally proposed by Chris, or *instead of* that? Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way. Since it's open for discussion, I'll put in my vote for implementing the easier latin-1 version in the short term to facilitate Python 2 / 3 interoperability. This would solve my use-case (giga-rows of short fixed length strings), and presumably allow things like memory mapping of large data files (like for FITS files in astropy.io.fits). I don't have a clue how the current 'U' dtype works under the hood, but from my user perspective it seems to work just fine in terms of interacting with Python 3
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 06:55:29AM -0700, Charles R Harris wrote: Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 On Python 2, unicode strings can operate transparently with byte strings: $ python Python 2.7.3 (default, Sep 26 2013, 20:03:06) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import numpy as bnp import numpy as np a = np.array([u'\xd5scar'], dtype='U') a array([u'\xd5scar'], dtype='U5') a[0] u'\xd5scar' import sys sys.stdout.encoding 'UTF-8' print(a[0]) # Encodes as 'utf-8' Õscar 'My name is %s' % a[0] # Decodes as ASCII u'My name is \xd5scar' print('My name is %s' % a[0]) # Encodes as UTF-8 My name is Õscar This is no better worse than the rest of the Py2 text model. So if the new dtype always returns a unicode string under Py2 it should work (as well as the Py2 text model ever does). and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way. What do you mean by this? PEP 393 uses UCS-1/2/4 not utf-8/16/32 i.e. it always uses a fixed-width encoding. You can just use the CPython C-API to create the unicode strings. The simplest way is probably use utf-8 internally and then call PyUnicode_DecodeUTF8 and PyUnicode_EncodeUTF8 at the boundaries. This should work fine on Python 2.x and 3.x. It obviates any need to think about pre-3.3 narrow and wide builds and post-3.3 FSR formats. Unlike Python's str there isn't much need to be able to efficiently slice or index within the string array element. Indexing into the array to get the string requires creating a new object, so you may as well just decode from utf-8 at that point [it's big-O(num chars) either way]. There's no need to constrain it to fixed-width encodings like the FSR in which case utf-8 is clearly the best choice as: 1) It covers the whole unicode spectrum. 2) It uses 1 byte-per-char for ASCII. 3) UTF-8 is a big optimisation target for CPython (so it's fast). Oscar ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.comwrote: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Jan 20, 2014 8:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque. Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer. Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics. If someone can call buffer on an array then the FSR is a semantic change. If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa. I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface. I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want transparent string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'. If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype=O for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter... The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest. Is this approach intended to be in *addition to* the latin-1 s type originally proposed by Chris, or *instead of* that? Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way. Since it's open for discussion, I'll put in my vote for implementing the easier latin-1 version in the short term to facilitate Python 2 / 3 interoperability. This would solve my use-case (giga-rows of short fixed length strings), and presumably allow things like memory mapping of large data files (like for FITS files in astropy.io.fits). I don't have a clue how the current 'U' dtype works under the
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, 2014-01-21 at 07:48 -0700, Charles R Harris wrote: On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas aldcr...@head.cfa.harvard.edu wrote: On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin oscar.j.benja...@gmail.com wrote: On Jan 20, 2014 8:35 PM, Charles R Harris charlesr.har...@gmail.com wrote: I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings. The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque. Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer. Since it is opaque no one can rightly expect it to expose a particular binary format so it
Re: [Numpy-discussion] (no subject)
What are your interests and experience? If you use numpy, are there things you would like to fix, or enhancements you would like to see? Chuck I am an undergraduate student with CS as major and have interest in Math and Physics. This has led me to use NumPy and SciPy to work on innumerable cases involving special polynomial functions and polynomials like Legendre polynomials, Bessel Functions and so on. So, The packages are closer known to me from this point of view. I have a* few proposals* in mind. But I don't have any idea if they are acceptable within the scope of GSoC 1. Many special functions and polynomials are neither included in NumPy nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames function), Cylindrical Harmonic function. Scipy at present supports only spherical Harmonic function. Further, why cant we extend SciPy to incorporate* Inverse Laplace Transforms*? At present Matlab has this amazing function *ilaplace* and SymPy does have *Inverse_Laplace_transform* but it would be better to incorporate all in one package. I mean SciPy does have function to evaluate laplace transform After having written this, I feel that this post should have been sent to SciPy but as a majority of contributors are the same I proceed. Please suggest any other possible projects, as I would like to continue with SciPy or NumPy, preferably NumPy as I have been fiddling with its source code for a month now and so am pretty comfortable with it. As for my experience, I have known C for past 4 years and have been a python lover for past 1 year. I am pretty new to open source communities, started before a manth and a half. regards Jennifer ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] (no subject)
On Tue, Jan 21, 2014 at 9:26 AM, jennifer stone jenny.stone...@gmail.comwrote: What are your interests and experience? If you use numpy, are there things you would like to fix, or enhancements you would like to see? Chuck I am an undergraduate student with CS as major and have interest in Math and Physics. This has led me to use NumPy and SciPy to work on innumerable cases involving special polynomial functions and polynomials like Legendre polynomials, Bessel Functions and so on. So, The packages are closer known to me from this point of view. I have a* few proposals* in mind. But I don't have any idea if they are acceptable within the scope of GSoC 1. Many special functions and polynomials are neither included in NumPy nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames function), Cylindrical Harmonic function. Scipy at present supports only spherical Harmonic function. Further, why cant we extend SciPy to incorporate* Inverse Laplace Transforms*? At present Matlab has this amazing function *ilaplace* and SymPy does have *Inverse_Laplace_transform* but it would be better to incorporate all in one package. I mean SciPy does have function to evaluate laplace transform After having written this, I feel that this post should have been sent to SciPy but as a majority of contributors are the same I proceed. Please suggest any other possible projects, as I would like to continue with SciPy or NumPy, preferably NumPy as I have been fiddling with its source code for a month now and so am pretty comfortable with it. As for my experience, I have known C for past 4 years and have been a python lover for past 1 year. I am pretty new to open source communities, started before a manth and a half. It does sound like scipy might be a better match, I don't think anyone would complain if you cross posted. Both scipy and numpy require GSOC candidates to have a pull request accepted as part of the application process. I'd suggest implementing a function not currently in scipy that you think would be useful. That would also help in finding a mentor for the summer. I'd also suggest getting familiar with cython. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] (no subject)
On Tue, 21 Jan 2014 21:56:17 +0530, jennifer stone wrote: I am an undergraduate student with CS as major and have interest in Math and Physics. This has led me to use NumPy and SciPy to work on innumerable cases involving special polynomial functions and polynomials like Legendre polynomials, Bessel Functions and so on. So, The packages are closer known to me from this point of view. I have a* few proposals* in mind. But I don't have any idea if they are acceptable within the scope of GSoC 1. Many special functions and polynomials are neither included in NumPy nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames function), Cylindrical Harmonic function. Scipy at present supports only spherical Harmonic function. SciPy's spherical harmonics are very inefficient if one is only interested in computing one specific order. I'd be so happy if someone would work on that! Stéfan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] (no subject)
On Tue, Jan 21, 2014 at 9:46 AM, Charles R Harris charlesr.har...@gmail.com wrote: On Tue, Jan 21, 2014 at 9:26 AM, jennifer stone jenny.stone...@gmail.comwrote: What are your interests and experience? If you use numpy, are there things you would like to fix, or enhancements you would like to see? Chuck I am an undergraduate student with CS as major and have interest in Math and Physics. This has led me to use NumPy and SciPy to work on innumerable cases involving special polynomial functions and polynomials like Legendre polynomials, Bessel Functions and so on. So, The packages are closer known to me from this point of view. I have a* few proposals* in mind. But I don't have any idea if they are acceptable within the scope of GSoC 1. Many special functions and polynomials are neither included in NumPy nor on SciPy.. These include Ellipsoidal Harmonic Functions (lames function), Cylindrical Harmonic function. Scipy at present supports only spherical Harmonic function. Further, why cant we extend SciPy to incorporate* Inverse Laplace Transforms*? At present Matlab has this amazing function *ilaplace* and SymPy does have *Inverse_Laplace_transform* but it would be better to incorporate all in one package. I mean SciPy does have function to evaluate laplace transform After having written this, I feel that this post should have been sent to SciPy but as a majority of contributors are the same I proceed. Please suggest any other possible projects, as I would like to continue with SciPy or NumPy, preferably NumPy as I have been fiddling with its source code for a month now and so am pretty comfortable with it. As for my experience, I have known C for past 4 years and have been a python lover for past 1 year. I am pretty new to open source communities, started before a manth and a half. It does sound like scipy might be a better match, I don't think anyone would complain if you cross posted. Both scipy and numpy require GSOC candidates to have a pull request accepted as part of the application process. I'd suggest implementing a function not currently in scipy that you think would be useful. That would also help in finding a mentor for the summer. I'd also suggest getting familiar with cython. I don't see you on github yet, are you there? If not, you should set up an account to work in. See the developer guide http://docs.scipy.org/doc/numpy/dev/for some pointers. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On 21 Jan 2014 17:28, David Goldsmith d.l.goldsm...@gmail.com wrote: Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? Sounds plausible, perhaps you could write up such a page? -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith d.l.goldsm...@gmail.comwrote: Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? Or maybe a NEP? https://github.com/numpy/numpy/tree/master/doc/neps sorry -- really swamped this week, so I won't be writing it... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
Date: Tue, 21 Jan 2014 17:35:26 + From: Nathaniel Smith n...@pobox.com Subject: Re: [Numpy-discussion] A one-byte string dtype? To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: CAPJVwB=+47ofYvnvN76= ke3xlga2+gz+qd4f0xs2uboeysg...@mail.gmail.com Content-Type: text/plain; charset=utf-8 On 21 Jan 2014 17:28, David Goldsmith d.l.goldsm...@gmail.com wrote: Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? Sounds plausible, perhaps you could write up such a page? -n I can certainly get one started (but I don't think I can faithfully summarize all this thread's current content, so I apologize in advance for leaving that undone). DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
A lot of good discussion here -- to much to comment individually, but it seems we can boil it down to a couple somewhat distinct proposals: 1) a one-byte-per-char dtype: This would provide compact, high efficiency storage for common text for scientific computing. It is analogous to a lower-precision numeric type -- i.e. it could not store any unicode strings -- only the subset that are compatible the suggested encoding. Suggested encoding: latin-1 Other options: - ascii only. - settable to any one-byte per char encoding supported by python I like this IFF it's pretty easy, but it may add significant complications (and overhead) for comparisons, etc NOTE: This is NOT a way to conflate bytes and text, and not a way to go back to the py2 mojibake hell -- the goal here is to very clearly have this be text data, and have a clearly defined encoding. Which is why we can't just use 'S' -- or adapt 'S' to do this. Rather is is a way to conveniently and efficiently use numpy for text that is ansi compatible. 2) a utf-8 dtype: NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte per char encoding, so would not snuggly into the numpy data model. It would give compact memory use for mostly-ascii data, so that would be nice. 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype. This would get us the advantages of the new py3 unicode model -- compact and efficient when it can be, but also supporting all of unicode. Honestly, this seems like more work than it's worth to me, at least given the current numpy dtype model -- maybe a nice addition to dynd. YOu can, after all, simply use an object array with py3 strings in it. Though perhaps using the py3 unicode type, but having a dtype that specifically links to that, rather than a generic python object would be a good compromise. Hmm -- I guess despite what I said, I just write the starting pint for a NEP... (or two, actually...) -Chris On Tue, Jan 21, 2014 at 9:46 AM, Chris Barker chris.bar...@noaa.gov wrote: On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith d.l.goldsm...@gmail.comwrote: Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? Or maybe a NEP? https://github.com/numpy/numpy/tree/master/doc/neps sorry -- really swamped this week, so I won't be writing it... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 11:00 AM, Chris Barker chris.bar...@noaa.govwrote: A lot of good discussion here -- to much to comment individually, but it seems we can boil it down to a couple somewhat distinct proposals: 1) a one-byte-per-char dtype: This would provide compact, high efficiency storage for common text for scientific computing. It is analogous to a lower-precision numeric type -- i.e. it could not store any unicode strings -- only the subset that are compatible the suggested encoding. Suggested encoding: latin-1 Other options: - ascii only. - settable to any one-byte per char encoding supported by python I like this IFF it's pretty easy, but it may add significant complications (and overhead) for comparisons, etc NOTE: This is NOT a way to conflate bytes and text, and not a way to go back to the py2 mojibake hell -- the goal here is to very clearly have this be text data, and have a clearly defined encoding. Which is why we can't just use 'S' -- or adapt 'S' to do this. Rather is is a way to conveniently and efficiently use numpy for text that is ansi compatible. 2) a utf-8 dtype: NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte per char encoding, so would not snuggly into the numpy data model. It would give compact memory use for mostly-ascii data, so that would be nice. 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype. This would get us the advantages of the new py3 unicode model -- compact and efficient when it can be, but also supporting all of unicode. Honestly, this seems like more work than it's worth to me, at least given the current numpy dtype model -- maybe a nice addition to dynd. YOu can, after all, simply use an object array with py3 strings in it. Though perhaps using the py3 unicode type, but having a dtype that specifically links to that, rather than a generic python object would be a good compromise. Hmm -- I guess despite what I said, I just write the starting pint for a NEP... Should also mention the reasons for adding a new data type. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 10:00 AM, numpy-discussion-requ...@scipy.orgwrote: Date: Tue, 21 Jan 2014 09:53:25 -0800 From: David Goldsmith d.l.goldsm...@gmail.com Subject: Re: [Numpy-discussion] A one-byte string dtype? To: numpy-discussion@scipy.org Message-ID: CAFtPsZqRrDxrshBMVyS+Z= 7altpxmrz4miujy2xebyi_fy5...@mail.gmail.com Content-Type: text/plain; charset=iso-8859-1 Date: Tue, 21 Jan 2014 17:35:26 + From: Nathaniel Smith n...@pobox.com Subject: Re: [Numpy-discussion] A one-byte string dtype? To: Discussion of Numerical Python numpy-discussion@scipy.org Message-ID: CAPJVwB=+47ofYvnvN76= ke3xlga2+gz+qd4f0xs2uboeysg...@mail.gmail.com Content-Type: text/plain; charset=utf-8 On 21 Jan 2014 17:28, David Goldsmith d.l.goldsm...@gmail.com wrote: Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? Sounds plausible, perhaps you could write up such a page? -n I can certainly get one started (but I don't think I can faithfully summarize all this thread's current content, so I apologize in advance for leaving that undone). DG OK, I'm lost already: is there general agreement that this should jump straight to one or more NEP's? If not (or if there should be a Wiki page for it additionally), should such become part of the NumPy Wiki @ Sourceforge or the SciPy Wiki at the scipy.org site? If the latter, is one's SciPy Wiki login the same as one's mailing list subscriber maintenance login? I guess starting such a page is not as trivial as I had assumed. DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Tue, Jan 21, 2014 at 6:34 PM, David Goldsmith d.l.goldsm...@gmail.com wrote: I can certainly get one started (but I don't think I can faithfully summarize all this thread's current content, so I apologize in advance for leaving that undone). DG OK, I'm lost already: is there general agreement that this should jump straight to one or more NEP's? If not (or if there should be a Wiki page for it additionally), should such become part of the NumPy Wiki @ Sourceforge or the SciPy Wiki at the scipy.org site? If the latter, is one's SciPy Wiki login the same as one's mailing list subscriber maintenance login? I guess starting such a page is not as trivial as I had assumed. The wiki is frozen. Please do not add anything to it. It plays no role in our current development workflow. Drafting a NEP or two and iterating on them would be the next step. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
Hi Chris, Just stumbled on this discussion (I'm the lead author of h5py). We would be overjoyed if there were a 1-byte text type available in NumPy. String handling is the source of major pain right now in the HDF5 world. All HDF5 strings are text (opaque types are used for binary data), but we're forced into using the S type most of the time because (1) the U type doesn't round-trip between HDF5 and NumPy, as there's no fixed-width wide-character string type in HDF5, and (2) U takes 4x the space, which is a problem for big scientific datasets. ASCII-only would be preferable, partly for selfish reasons (HDF5's default is ASCII only), and partly to make it possible to copy them into containers labelled UTF-8 without manually inspecting every value. At the high-level interface, h5py exposes three kinds of strings. Each maps to a specific type within Python (but see str_py3 below): Fixed-length ASCII (NumPy S type) This is wrong, or mis-guided, or maybe only a little confusing -- 'S' is not an ASCII string (even though I wish it were...). But clearly the HDF folsk think we need one! Yes, this was intended to state that the HDF5 Fixed-width ASCII type maps to NumPy S at conversion time, which is obviously a wretched solution on Py3. dset = f.create_dataset(string_ds, (100,), dtype=S10) Pardon my py3 ignorance -- is numpy.string_ the same as 'S' in py3? Form another post, I thought you'd need to use numpy.bytes_ (which is the same on py2) It does produce an instance of 'numpy.bytes_', although I think the h5py docs should be changed to use bytes_ explicitly. Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
On Tue, Jan 21, 2014 at 3:22 PM, Andrew Collette andrew.colle...@gmail.comwrote: Just stumbled on this discussion (I'm the lead author of h5py). We would be overjoyed if there were a 1-byte text type available in NumPy. cool -- it looks like someone is going to get a draft PEP going -- so stay tuned, and add you comments when there is something to add them too.. String handling is the source of major pain right now in the HDF5 world. All HDF5 strings are text (opaque types are used for binary data), but we're forced into using the S type most of the time because (1) the U type doesn't round-trip between HDF5 and NumPy, as there's no fixed-width wide-character string type in HDF5, it looks from here: http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a lot of calls to encode/decode -- which could be pretty slow, compared to other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by doesn't round trip. This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary encoding dtype, anyway). But: How does hdf handle the fact that utf-8 is not a fixed length encoding? ASCII-only would be preferable, partly for selfish reasons (HDF5's default is ASCII only), and partly to make it possible to copy them into containers labelled UTF-8 without manually inspecting every value. hmm -- ascii does have those advantages, but I'm not sure its worth the restriction on what can be encoded. But you're quite right, you could dump asciii straight into something expecting utf-8, whereas you could not do that with latin-1, for instance. But you can't go the other way -- does it help much to avoided encoding in one direction? But maybe we can have a any-one-byte-per-char encoding option, in which case hdfpy could use ascii, but we wouldn't have to everywhere. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
Date: Tue, 21 Jan 2014 19:20:12 + From: Robert Kern robert.k...@gmail.com Subject: Re: [Numpy-discussion] A one-byte string dtype? The wiki is frozen. Please do not add anything to it. It plays no role in our current development workflow. Drafting a NEP or two and iterating on them would be the next step. -- Robert Kern OK, well that's definitely beyond my level of expertise. DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A one-byte string dtype?
On Jan 21, 2014, at 4:58 PM, David Goldsmith d.l.goldsm...@gmail.com wrote: OK, well that's definitely beyond my level of expertise. Well, it's in github--now's as good a time as any to learn github collaboration... -Fork the numpy source. -Create a new file in: numpy/doc/neps Point folks to it here so they can comment, etc. At some point, issue a pull request, and it can get merged into the main source for final polishing... -Chris DG ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] using loadtxt to load a text file in to a numpy array
Hi Chris, it looks from here: http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a lot of calls to encode/decode -- which could be pretty slow, compared to other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by doesn't round trip. HDF5 does have variable-length string support for UTF-8, so we map that directly to the unicode type (str on Py3) exactly as you describe, by encoding when we write to the file. But there's no way to round-trip with *fixed-width* strings. You can go from e.g. a 10 byte ASCII string to U10, but going the other way fails if there are characters which take more than 1 byte to represent. We don't always get to choose the destination type, when e.g. writing into an existing dataset, so we can't always write vlen strings. This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary encoding dtype, anyway). But: How does hdf handle the fact that utf-8 is not a fixed length encoding? With fixed-width strings it doesn't, really. If you use vlen strings it's fine, but otherwise there's just a fixed-width buffer labelled UTF-8. Presumably you're supposed to be careful when writing not to chop the string off in the middle of a multibyte character. We could truncate strings on their way to the file, but the risk of data loss/corruption led us to simply not support it at all. hmm -- ascii does have those advantages, but I'm not sure its worth the restriction on what can be encoded. But you're quite right, you could dump asciii straight into something expecting utf-8, whereas you could not do that with latin-1, for instance. But you can't go the other way -- does it help much to avoided encoding in one direction? It would help for h5py specifically because most HDF5 strings are labelled ASCII. But it's a question for the community which is more important: the high-bit characters in latin-1, or write-compatibility with UTF-8. Andrew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] fromiter cannot create array of object - was: Creating an ndarray from an iterable, over sequences
Hi, thanks. Both recarray and itertools.chain work just fine in the example case. However, the real purpose of this is to read strings from a large xml file into a pandas DataFrame. But fromiter cannot create arrays of dtype 'object'. Fixed length strings may be worth trying. But as the xml schema does not guarantee a max. length, and pandas generally uses 'object' arrays for strings, I see no better way than creating the array through list comprehensions and turn it into a DataFrame. Maybe a variable length string/unicode type would help in the long term. Leo I would like to write something like: In [25]: iterable=((i, i**2) for i in range(10)) In [26]: a=np.fromiter(iterable, int32) --- ValueErrorTraceback (most recent call last) ipython-input-26-5bcc2e94dbca in module() 1 a=np.fromiter(iterable, int32) ValueError: setting an array element with a sequence. Is there an efficient way to do this? Perhaps you could just utilize structured arrays ( http://docs.scipy.org/doc/numpy/user/basics.rec.html), like: iterable= ((i, i**2) for i in range(10)) a= np.fromiter(iterable, [('a', int32), ('b', int32)], 10) a.view(int32).reshape(-1, 2) You could use itertools: from itertools import chain g = ((i, i**2) for i in range(10)) import numpy numpy.fromiter(chain.from_iterable(g), numpy.int32).reshape(-1, 2) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion