Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Sunday 06 December 2009 11:47:23 Francesc Alted escrigué: A Saturday 05 December 2009 11:16:55 Dag Sverre Seljebotn escrigué: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Well, I for one don't like this, but that's just an opinion. I think it is unwise to leave object which supports hash() mutable, because it's too easy to make hard to find bugs (sticking a dtype as a key in a dict is rather useful in many situations). There's a certain tradition in Python for leaving types immutable if possible, and dtype certainly feels like it. Yes, I think you are right and force dtype to be immutable would be the best. I've filed a ticket so that we don't loose track of this: http://projects.scipy.org/numpy/ticket/1321 -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Saturday 05 December 2009 11:16:55 Dag Sverre Seljebotn escrigué: Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Well, I for one don't like this, but that's just an opinion. I think it is unwise to leave object which supports hash() mutable, because it's too easy to make hard to find bugs (sticking a dtype as a key in a dict is rather useful in many situations). There's a certain tradition in Python for leaving types immutable if possible, and dtype certainly feels like it. Yes, I think you are right and force dtype to be immutable would be the best. As a bonus, an immutable dtype would render this ticket: http://projects.scipy.org/numpy/ticket/1127 without effect. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
Francesc Alted wrote: A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué: Pauli Virtanen wrote: Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) We discussed this at SciPy 09 -- basically, they are not necesarrily immutable in implementation, but anywhere they are not that is a bug and no code should depend on their mutability, so we are free to assume so. Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Well, I for one don't like this, but that's just an opinion. I think it is unwise to leave object which supports hash() mutable, because it's too easy to make hard to find bugs (sticking a dtype as a key in a dict is rather useful in many situations). There's a certain tradition in Python for leaving types immutable if possible, and dtype certainly feels like it. Anyway, the buffer PEP can be supported simply by updating the buffer format string on the names setter, so it's an orthogonal issue. BTW note that the buffer PEP provides for supplying names of fields: T{ i:one: f:other: } (or similar). NumPy should probably do so at one point in the future; the Cython implementation doesn't because Cython doesn't use this information. -- Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Sat, Dec 5, 2009 at 7:16 PM, Dag Sverre Seljebotn da...@student.matnat.uio.no wrote: Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Well, I for one don't like this, but that's just an opinion. I think it is unwise to leave object which supports hash() mutable, because it's too easy to make hard to find bugs (sticking a dtype as a key in a dict is rather useful in many situations). There's a certain tradition in Python for leaving types immutable if possible, and dtype certainly feels like it. I agree the behavior is a bit surprising, but I don't know if code relies on compound dtype names to be immutable out there. Also, the fact that names attribute is a tuple and not a list also suggests that the intent is to be immutable. I am more worried about the variations between python versions ATM, though, I have no idea where it is coming from. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué: Pauli Virtanen wrote: Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) We discussed this at SciPy 09 -- basically, they are not necesarrily immutable in implementation, but anywhere they are not that is a bug and no code should depend on their mutability, so we are free to assume so. Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Fri, Dec 4, 2009 at 9:23 PM, Francesc Alted fal...@pytables.org wrote: A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué: Pauli Virtanen wrote: Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) We discussed this at SciPy 09 -- basically, they are not necesarrily immutable in implementation, but anywhere they are not that is a bug and no code should depend on their mutability, so we are free to assume so. Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Hm, that's strange - I get the same hash in both cases, but I thought I took into account names when I implemented the hashing protocol for dtype. Which version of numpy on which os are you seeing this ? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On 12/04/2009 10:12 AM, David Cournapeau wrote: On Fri, Dec 4, 2009 at 9:23 PM, Francesc Altedfal...@pytables.org wrote: A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué: Pauli Virtanen wrote: Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) We discussed this at SciPy 09 -- basically, they are not necesarrily immutable in implementation, but anywhere they are not that is a bug and no code should depend on their mutability, so we are free to assume so. Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Hm, that's strange - I get the same hash in both cases, but I thought I took into account names when I implemented the hashing protocol for dtype. Which version of numpy on which os are you seeing this ? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi, On the same linux 64-bit Fedora 11, I get the same hash with Python2.4 and numpy 1.3 but different hashes for Python2.6 and numpy 1.4. Bruce Python 2.6 (r26:66714, Jun 8 2009, 16:07:29) [GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2 Type help, copyright, credits or license for more information. import numpy as np np.__version__ '1.4.0.dev7750' t = np.dtype(i4,f4) t dtype([('f0', 'i4'), ('f1', 'f4')]) hash(t) -9041335829180134223 t.names = ('one', 'other') t dtype([('one', 'i4'), ('other', 'f4')]) hash(t) 8637734220020415106 Python 2.4.5 (#1, Oct 6 2008, 09:54:35) [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2 Type help, copyright, credits or license for more information. import numpy as np np.__version__ '1.3.0.dev6653' t = np.dtype(i4,f4) hash(t) 140053539914640 t.names = ('one', 'other') hash(t) 140053539914640 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Sat, Dec 5, 2009 at 1:31 AM, Bruce Southey bsout...@gmail.com wrote: On 12/04/2009 10:12 AM, David Cournapeau wrote: On Fri, Dec 4, 2009 at 9:23 PM, Francesc Altedfal...@pytables.org wrote: A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué: Pauli Virtanen wrote: Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) We discussed this at SciPy 09 -- basically, they are not necesarrily immutable in implementation, but anywhere they are not that is a bug and no code should depend on their mutability, so we are free to assume so. Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Hm, that's strange - I get the same hash in both cases, but I thought I took into account names when I implemented the hashing protocol for dtype. Which version of numpy on which os are you seeing this ? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi, On the same linux 64-bit Fedora 11, I get the same hash with Python2.4 and numpy 1.3 but different hashes for Python2.6 and numpy 1.4. Could you check the behavior of 1.4.0 on 2.4 ? The code doing hashing for dtypes has not changed since 1.3.0, so normally only the python should have an influence. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Friday 04 December 2009 17:12:09 David Cournapeau escrigué: Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Hm, that's strange - I get the same hash in both cases, but I thought I took into account names when I implemented the hashing protocol for dtype. Which version of numpy on which os are you seeing this ? numpy: 1.4.0.dev7072 python: 2.6.1 -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Sat, Dec 5, 2009 at 1:57 AM, David Cournapeau courn...@gmail.com wrote: On Sat, Dec 5, 2009 at 1:31 AM, Bruce Southey bsout...@gmail.com wrote: On 12/04/2009 10:12 AM, David Cournapeau wrote: On Fri, Dec 4, 2009 at 9:23 PM, Francesc Altedfal...@pytables.org wrote: A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué: Pauli Virtanen wrote: Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) We discussed this at SciPy 09 -- basically, they are not necesarrily immutable in implementation, but anywhere they are not that is a bug and no code should depend on their mutability, so we are free to assume so. Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Hm, that's strange - I get the same hash in both cases, but I thought I took into account names when I implemented the hashing protocol for dtype. Which version of numpy on which os are you seeing this ? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi, On the same linux 64-bit Fedora 11, I get the same hash with Python2.4 and numpy 1.3 but different hashes for Python2.6 and numpy 1.4. Could you check the behavior of 1.4.0 on 2.4 ? The code doing hashing for dtypes has not changed since 1.3.0, so normally only the python should have an influence. When I say should, it should be understood as this is the only reason why I think it could be different - the behavior should certainly not depend on the python version. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On 12/04/2009 10:57 AM, David Cournapeau wrote: On Sat, Dec 5, 2009 at 1:31 AM, Bruce Southeybsout...@gmail.com wrote: On 12/04/2009 10:12 AM, David Cournapeau wrote: On Fri, Dec 4, 2009 at 9:23 PM, Francesc Altedfal...@pytables.org wrote: A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué: Pauli Virtanen wrote: Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) We discussed this at SciPy 09 -- basically, they are not necesarrily immutable in implementation, but anywhere they are not that is a bug and no code should depend on their mutability, so we are free to assume so. Mmh, the only case that I'm aware about dtype *mutability* is changing the names of compound types: In [19]: t = np.dtype(i4,f4) In [20]: t Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')]) In [21]: hash(t) Out[21]: -9041335829180134223 In [22]: t.names = ('one', 'other') In [23]: t Out[23]: dtype([('one', 'i4'), ('other', 'f4')]) In [24]: hash(t) Out[24]: 8637734220020415106 Perhaps this should be marked as a bug? I'm not sure about that, because the above seems quite useful. Hm, that's strange - I get the same hash in both cases, but I thought I took into account names when I implemented the hashing protocol for dtype. Which version of numpy on which os are you seeing this ? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Hi, On the same linux 64-bit Fedora 11, I get the same hash with Python2.4 and numpy 1.3 but different hashes for Python2.6 and numpy 1.4. Could you check the behavior of 1.4.0 on 2.4 ? The code doing hashing for dtypes has not changed since 1.3.0, so normally only the python should have an influence. David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion These are different with Python 2.4 and numpy 1.4. Curiously I got different hash values with Python 2.5 and numpy 1.3. (For what it is worth, I get the same hash values with Python 2.3 with numpy 1.1.1). Bruce Python 2.5.2 (r252:60911, Nov 18 2008, 09:20:42) [GCC 4.3.2 20081105 (Red Hat 4.3.2-7)] on linux2 Type help, copyright, credits or license for more information. import numpy as np np.__version__ '1.3.0' t = np.dtype(i4,f4) hash(t) -9041335829180134223 t.names = ('one', 'other') hash(t) 8637734220020415106 [bsout...@starling python]$ /usr/local/bin/python2.4 Python 2.4.5 (#1, Oct 6 2008, 09:54:35) [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2 Type help, copyright, credits or license for more information. import numpy as np np.__version__ '1.4.0rc1' t = np.dtype(i4,f4) hash(t) -9041335829180134223 t.names = ('one', 'other') hash(t) 8637734220020415106 Python 2.3.7 (#1, Oct 6 2008, 09:55:54) [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2 Type help, copyright, credits or license for more information. import numpy as np np.__version__ '1.1.1' t = np.dtype(i4,f4) hash(t) 140552637936672 t.names = ('one', 'other') hash(t) 140552637936672 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
Dag Sverre Seljebotn wrote: Dag Sverre Seljebotn wrote: Pauli Virtanen wrote: Fri, 27 Nov 2009 23:19:58 +0100, Dag Sverre Seljebotn wrote: [clip] One thing to keep in mind here is that PEP 3118 actually defines a standard dtype format string, which is (mostly) incompatible with NumPy's. It should probably be supported as well when PEP 3118 is implemented. PEP 3118 is for the most part implemented in my Py3K branch now -- it was not actually much work, as I could steal most of the format string converter from numpy.pxd. Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) As far as numpy.pxd goes -- well, for the simplest dtypes. Some questions: How hard do we want to try supplying a buffer? Eg. if the consumer does not specify strided but specifies suboffsets, should we try to compute suitable suboffsets? Should we try making contiguous copies of the data (I guess this would break buffer semantics?)? Actually per the PEP, suboffsets imply strided: #define PyBUF_INDIRECT (0x0100 | PyBUF_STRIDES) :-) So there's no real way for a consumer to specify only suboffsets, 0x0100 is not a possible flag I think. Suboffsets can't really work without the strides anyway IIUC, and in the case of NumPy the field can always be left at 0. That is, NULL! IMO one should very much stay clear of making contiguous copies, especially considering the existance of PyBuffer_ToContiguous, which makes it trivial for client code to get a pointer to a contiguous buffer anyway. The intention of the PEP seems to be to export the buffer in as raw form as possible. Do keep in mind that IS_C_CONTIGUOUS and IS_F_CONTIGUOUS go be too conservative with NumPy arrays. If a contiguous buffer is requested, then looping through the strides and checking that the strides are monotonically decreasing/increasing could eventually save copying in some cases. I think that could be worth it -- I actually have my own And, of course, that the innermost stride is 1. Aargh. Some day I'll find/implement a 10 minute send delay for my email program, so I'll catch my errors before the emails go out... Anyway, this is not sufficient, one must also check correspondance with shape, of course. Dag Sverre code for IS_F_CONTIGUOUS rather than relying on the flags personally because of this issue, so it does come up in practice. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) Do you have a case in mind where the speed of format string generation would be a bottleneck? Some questions: How hard do we want to try supplying a buffer? Eg. if the consumer does not specify strided but specifies suboffsets, should we try to compute suitable suboffsets? Should we try making contiguous copies of the data (I guess this would break buffer semantics?)? Actually per the PEP, suboffsets imply strided: #define PyBUF_INDIRECT (0x0100 | PyBUF_STRIDES) :-) So there's no real way for a consumer to specify only suboffsets, 0x0100 is not a possible flag I think. Suboffsets can't really work without the strides anyway IIUC, and in the case of NumPy the field can always be left at 0. Ok, great! IMO one should very much stay clear of making contiguous copies, especially considering the existance of PyBuffer_ToContiguous, which makes it trivial for client code to get a pointer to a contiguous buffer anyway. The intention of the PEP seems to be to export the buffer in as raw form as possible. This is what I thought, too. Do keep in mind that IS_C_CONTIGUOUS and IS_F_CONTIGUOUS go be too conservative with NumPy arrays. If a contiguous buffer is requested, then looping through the strides and checking that the strides are monotonically decreasing/increasing could eventually save copying in some cases. I think that could be worth it -- I actually have my own code for IS_F_CONTIGUOUS rather than relying on the flags personally because of this issue, so it does come up in practice. Are you sure? Assume monotonically increasing or decreasing strides with inner stride of itemsize. Now, if the strides are not C or F-contiguous, doesn't this imply that part of the data in the memory block is *not* pointed to by a set of indices? [For example, strides = {itemsize, 3*itemsize}; dims = {2, 2}. Now, there is unused memory between items (1,0) and (0,1).] This probably boils down to what exactly was meant in the PEP and Python docs by contiguous. I'd believe it was meant to be the same as in Numpy -- that you can send the array data e.g. to Fortran as-is. If so, there should not be gaps in the data, if the client explicitly requested that the buffer be contiguous. Maybe you meant that the Numpy array flags (which the macros check) are not always up-to-date wrt. the stride information? -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
Pauli Virtanen wrote: Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote: [clip] Great! Are you storing the format string in the dtype types as well? (So that no release is needed and acquisitions are cheap...) I regenerate it on each buffer acquisition. It's simple low-level C code, and I suspect it will always be fast enough. Of course, we could *cache* the result in the dtype. (If dtypes are immutable, which I don't remember right now.) We discussed this at SciPy 09 -- basically, they are not necesarrily immutable in implementation, but anywhere they are not that is a bug and no code should depend on their mutability, so we are free to assume so. Do you have a case in mind where the speed of format string generation would be a bottleneck? Going all the way down to user code; no. Well, contrived: You have a Python list of NumPy arrays and want to sum over the first element of each, acquiring the buffer by PEP 3118 (which is easy through Cython). In that case I can see all the memory allocation that must go on for each element for the format-string as a bottle-neck. But mostly it's from cleanliness of implementation, like the fact that you don't know up-front how long the string need to be for nested dtypes. Obviously, what you have done is much better than nothing, and probably sufficient for nearly all purposes, so I should stop complaining. Do keep in mind that IS_C_CONTIGUOUS and IS_F_CONTIGUOUS go be too conservative with NumPy arrays. If a contiguous buffer is requested, then looping through the strides and checking that the strides are monotonically decreasing/increasing could eventually save copying in some cases. I think that could be worth it -- I actually have my own code for IS_F_CONTIGUOUS rather than relying on the flags personally because of this issue, so it does come up in practice. Are you sure? Assume monotonically increasing or decreasing strides with inner stride of itemsize. Now, if the strides are not C or F-contiguous, doesn't this imply that part of the data in the memory block is *not* pointed to by a set of indices? [For example, strides = {itemsize, 3*itemsize}; dims = {2, 2}. Now, there is unused memory between items (1,0) and (0,1).] This probably boils down to what exactly was meant in the PEP and Python docs by contiguous. I'd believe it was meant to be the same as in Numpy -- that you can send the array data e.g. to Fortran as-is. If so, there should not be gaps in the data, if the client explicitly requested that the buffer be contiguous. Maybe you meant that the Numpy array flags (which the macros check) are not always up-to-date wrt. the stride information? Yep, this is what I meant, and the rest is wrong. But now that I think about it, the case that bit me is In [14]: np.arange(10)[None, None, :].flags.c_contiguous Out[14]: False I suppose this particular case could be fixed properly with little cost (if it isn't already). It is probably cleaner to just rely on the flags for PEP 3118, less confusion etc. Sorry for the distraction. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
to, 2009-11-26 kello 17:37 -0700, Charles R Harris kirjoitti: [clip] I'm not clear on your recommendation here, is it that we should use bytes, with unicode converted to UTF8? The point is that I don't think we can just decide to use Unicode or Bytes in all places where PyString was used earlier. Which one it will be should depend on the use. Users will expect that eg. array([1,2,3], dtype='f4') still works, and they don't have to do e.g. array([1,2,3], dtype=b'f4'). To summarize the use cases I've ran across so far: 1) For 'S' dtype, I believe we use Bytes for the raw data and the interface. Maybe we want to introduce a separate bytes dtype that's an alias for 'S'? 2) The field names: a = array([], dtype=[('a', int)]) a = array([], dtype=[(b'a', int)]) This is somewhat of an internal issue. We need to decide whether we internally coerce input to Unicode or Bytes. Or whether we allow for both Unicode and Bytes (but preserving previous semantics in this case requires extra work, due to semantic changes in PyDict). Currently, there's some code in Numpy to allow for Unicode field names, but it's not been coherently implemented in all places, so e.g. direct creation of dtypes with unicode field names fails. This has also implications on field titles, as also those are stored in the fields dict. 3) Format strings a = array([], dtype=b'i4') I don't think it makes sense to handle format strings in Unicode internally -- they should always be coerced to bytes. This will make it easier at many points, since it will be enought to do PyBytes_AS_STRING(str) to get the char* pointer, rather than having to encode to utf-8 first. Same for all other similar uses of string, e.g. protocol descriptors. User input should just be coerced to ASCII on input, I believe. The problem here is that preserving repr() in this case requires some extra work. But maybe that has to be done. Will that support arrays that have been pickled and such? Are the pickles backward compatible between Python 2 and 3 at all? I think using Bytes for format strings will be backward-compatible. Field names are then a bit more difficult. Actually, we'll probably just have to coerce them to either Bytes or Unicode internally, since we'll need to do that on unpickling if we want to be backward-compatible. Or will we just have a minimum of code to fix up? I think we will need in any case to replace all use of PyString in Numpy by PyBytes or PyUnicode, depending on context, and #define PyString PyBytes for Python 2. This seems to be the easiest way to make sure we have fixed all points that need fixing. Currently, 193 of 800 numpy.core tests don't pass, and this seems largely due to Bytes vs. Unicode issues. And could you expand on the changes that repr() might undergo? The main thing is that dtype('i4') dtype([('a', 'i4')]) may become dtype(b'i4') dtype([(b'a', b'i4')]) Of course, we can write and #ifdef separate repr formatting code for Py3, but this is a bit of extra work. Mind, I think using bytes sounds best, but I haven't looked into the whole strings part of the transition and don't have an informed opinion on the matter. *** By the way, should I commit this stuff (after factoring the commits to logical chunks) to SVN? It does not break anything for Python 2, at least as far as the test suite is concerned. Pauli ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
Pauli Virtanen wrote: By the way, should I commit this stuff (after factoring the commits to logical chunks) to SVN? I would prefer getting at least one py3 buildbot before doing anything significant, cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
pe, 2009-11-27 kello 18:30 +0900, David Cournapeau kirjoitti: Pauli Virtanen wrote: By the way, should I commit this stuff (after factoring the commits to logical chunks) to SVN? I would prefer getting at least one py3 buildbot before doing anything significant, I can add it to mine: http://buildbot.scipy.org/builders/Linux_x86_Ubuntu/builds/279/steps/shell_1/logs/stdio It already does 2.4, 2.5 and 2.6. Pauli ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Friday 27 November 2009 10:47:53 Pauli Virtanen escrigué: 1) For 'S' dtype, I believe we use Bytes for the raw data and the interface. Maybe we want to introduce a separate bytes dtype that's an alias for 'S'? Yeah. As regular strings in Python 3 are Unicode, I think that introducing separate bytes dtype would help doing the transition. Meanwhile, the next should still work: In [2]: s = np.array(['asa'], dtype=S10) In [3]: s[0] Out[3]: 'asa' # will become b'asa' in Python 3 In [4]: s.dtype.itemsize Out[4]: 10 # still 1-byte per element Also, I suppose that there will be issues with the current Unicode support in NumPy: In [5]: u = np.array(['asa'], dtype=U10) In [6]: u[0] Out[6]: u'asa' # will become 'asa' in Python 3 In [7]: u.dtype.itemsize Out[7]: 40 # not sure about the size in Python 3 For example, if it is true that internal strings in Python 3 and Unicode UTF-8 (as René seems to suggest), I suppose that the internal conversions from 2- bytes or 4-bytes (depending on how the Python interpreter has been compiled) in NumPy Unicode dtype to the new Python string should have to be reworked (perhaps you have dealt with that already). Cheers, -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
pe, 2009-11-27 kello 11:17 +0100, Francesc Alted kirjoitti: A Friday 27 November 2009 10:47:53 Pauli Virtanen escrigué: 1) For 'S' dtype, I believe we use Bytes for the raw data and the interface. Maybe we want to introduce a separate bytes dtype that's an alias for 'S'? Yeah. As regular strings in Python 3 are Unicode, I think that introducing separate bytes dtype would help doing the transition. Meanwhile, the next should still work: In [2]: s = np.array(['asa'], dtype=S10) In [3]: s[0] Out[3]: 'asa' # will become b'asa' in Python 3 In [4]: s.dtype.itemsize Out[4]: 10 # still 1-byte per element Yes. But now I wonder, should array(['foo'], str) array(['foo']) be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U', which will mean unavoidable code breakage -- there's probably no avoiding it. [clip] Also, I suppose that there will be issues with the current Unicode support in NumPy: In [5]: u = np.array(['asa'], dtype=U10) In [6]: u[0] Out[6]: u'asa' # will become 'asa' in Python 3 In [7]: u.dtype.itemsize Out[7]: 40 # not sure about the size in Python 3 I suspect the Unicode stuff will keep working without major changes, except maybe dropping the u in repr. It is difficult to believe the CPython guys would have significantly changed the current Unicode implementation, if they didn't bother changing the names of the functions :) For example, if it is true that internal strings in Python 3 and Unicode UTF-8 (as René seems to suggest), I suppose that the internal conversions from 2- bytes or 4-bytes (depending on how the Python interpreter has been compiled) in NumPy Unicode dtype to the new Python string should have to be reworked (perhaps you have dealt with that already). I don't think they are internally UTF-8: http://docs.python.org/3.1/c-api/unicode.html Python’s default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Friday 27 November 2009 11:27:00 Pauli Virtanen escrigué: Yes. But now I wonder, should array(['foo'], str) array(['foo']) be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U', which will mean unavoidable code breakage -- there's probably no avoiding it. Mmh, you are right. Yes, this seems to be difficult to solve. Well, I'm changing my mind and think that both 'str' and 'S' should stand for Unicode in NumPy for Python 3. If people is aware of the change for Python 3, they should be expecting the same change happening in NumPy too, I guess. Then, I suppose that a new dtype bytes that replaces the existing string would be absolutely necessary. Also, I suppose that there will be issues with the current Unicode support in NumPy: In [5]: u = np.array(['asa'], dtype=U10) In [6]: u[0] Out[6]: u'asa' # will become 'asa' in Python 3 In [7]: u.dtype.itemsize Out[7]: 40 # not sure about the size in Python 3 I suspect the Unicode stuff will keep working without major changes, except maybe dropping the u in repr. It is difficult to believe the CPython guys would have significantly changed the current Unicode implementation, if they didn't bother changing the names of the functions :) For example, if it is true that internal strings in Python 3 and Unicode UTF-8 (as René seems to suggest), I suppose that the internal conversions from 2- bytes or 4-bytes (depending on how the Python interpreter has been compiled) in NumPy Unicode dtype to the new Python string should have to be reworked (perhaps you have dealt with that already). I don't think they are internally UTF-8: http://docs.python.org/3.1/c-api/unicode.html Python’s default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. Ah! No changes for that matter. Much better then. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Fri, Nov 27, 2009 at 11:50 AM, Francesc Alted fal...@pytables.org wrote: A Friday 27 November 2009 11:27:00 Pauli Virtanen escrigué: Yes. But now I wonder, should array(['foo'], str) array(['foo']) be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U', which will mean unavoidable code breakage -- there's probably no avoiding it. Mmh, you are right. Yes, this seems to be difficult to solve. Well, I'm changing my mind and think that both 'str' and 'S' should stand for Unicode in NumPy for Python 3. If people is aware of the change for Python 3, they should be expecting the same change happening in NumPy too, I guess. Then, I suppose that a new dtype bytes that replaces the existing string would be absolutely necessary. Also, I suppose that there will be issues with the current Unicode support in NumPy: In [5]: u = np.array(['asa'], dtype=U10) In [6]: u[0] Out[6]: u'asa' # will become 'asa' in Python 3 In [7]: u.dtype.itemsize Out[7]: 40 # not sure about the size in Python 3 I suspect the Unicode stuff will keep working without major changes, except maybe dropping the u in repr. It is difficult to believe the CPython guys would have significantly changed the current Unicode implementation, if they didn't bother changing the names of the functions :) For example, if it is true that internal strings in Python 3 and Unicode UTF-8 (as René seems to suggest), I suppose that the internal conversions from 2- bytes or 4-bytes (depending on how the Python interpreter has been compiled) in NumPy Unicode dtype to the new Python string should have to be reworked (perhaps you have dealt with that already). I don't think they are internally UTF-8: http://docs.python.org/3.1/c-api/unicode.html Python’s default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. Ah! No changes for that matter. Much better then. Hello, in py3... 'Hello\u0020World !'.encode() b'Hello World !' Äpfel.encode('utf-8') b'\xc3\x84pfel' Äpfel.encode() b'\xc3\x84pfel' The default encoding does appear to be utf-8 in py3. Although it is compiled with something different, and stores it as something different, that is UCS2 or UCS4. I imagine dtype 'S' and 'U' need more clarification. As it misses the concept of encodings it seems? Currently, S appears to mean 8bit characters no encoding, and U appears to mean 16bit characters no encoding? Or are some sort of default encodings assumed? 2to3/3to2 fixers will probably have to be written for users code here... whatever is decided. At least warnings should be generated I'm guessing. btw, in my numpy tree there is a unicode_() alias to str in py3, and to unicode in py2 (inside the compat.py file). This helped us in many cases with compatible string code in the pygame port. This allows you to create unicode strings on both platforms with the same code. cheers, ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
pe, 2009-11-27 kello 13:23 +0100, René Dudfield kirjoitti: [clip] I imagine dtype 'S' and 'U' need more clarification. As it misses the concept of encodings it seems? Currently, S appears to mean 8bit characters no encoding, and U appears to mean 16bit characters no encoding? Or are some sort of default encodings assumed? Currently in Numpy in Python 2, 'S' is the same as Python 3 bytes, 'U' is same as Python 3 unicode and probably in same internal representation (need to check). Neither is associated with encoding info. We need probably to change the meaning of 'S', as Francesc noted, and add a separate bytes dtype. 2to3/3to2 fixers will probably have to be written for users code here... whatever is decided. At least warnings should be generated I'm guessing. Possibly. Does 2to3 support plugins? If yes, it could be possible to write one. btw, in my numpy tree there is a unicode_() alias to str in py3, and to unicode in py2 (inside the compat.py file). This helped us in many cases with compatible string code in the pygame port. This allows you to create unicode strings on both platforms with the same code. Yes, I saw that. The name unicode_ is however already taken by the Numpy scalar type, so we need to think of a different name for it. asstring, maybe. Btw, do you want to rebase your distutils changes on top of my tree? I tried yours out quickly, but there were some issues there that prevented distutils from working. (Also, you can use absolute imports both for Python 2 and 3 -- there's probably no need to use relative imports.) Pauli ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Friday 27 November 2009 13:23:10 René Dudfield escrigué: I don't think they are internally UTF-8: http://docs.python.org/3.1/c-api/unicode.html Python’s default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. Ah! No changes for that matter. Much better then. Hello, in py3... 'Hello\u0020World !'.encode() b'Hello World !' Äpfel.encode('utf-8') b'\xc3\x84pfel' Äpfel.encode() b'\xc3\x84pfel' The default encoding does appear to be utf-8 in py3. Although it is compiled with something different, and stores it as something different, that is UCS2 or UCS4. OK. One thing is which is the default encoding for Unicode and another is how Python keeps Unicode internally. And internally Python 3 is still using UCS2 or UCS4, i.e. the same thing than in Python 2, so no worries here. I imagine dtype 'S' and 'U' need more clarification. As it misses the concept of encodings it seems? Currently, S appears to mean 8bit characters no encoding, and U appears to mean 16bit characters no encoding? Or are some sort of default encodings assumed? [clip] You only need encoding if you are going to represent Unicode strings with other types (for example bytes). Currently, NumPy can transparently import/export native Python Unicode strings (UCS2 or UCS4) into its own Unicode type (always UCS4). So, we don't have to worry here either. btw, in my numpy tree there is a unicode_() alias to str in py3, and to unicode in py2 (inside the compat.py file). This helped us in many cases with compatible string code in the pygame port. This allows you to create unicode strings on both platforms with the same code. Correct. But, in addition, we are going to need a new 'bytes' dtype for NumPy for Python 3, right? -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Fri, Nov 27, 2009 at 1:41 PM, Pauli Virtanen p...@iki.fi wrote: 2to3/3to2 fixers will probably have to be written for users code here... whatever is decided. At least warnings should be generated I'm guessing. Possibly. Does 2to3 support plugins? If yes, it could be possible to write one. You can put them in here: [lib_dir]lib2to3/fixes/fix_*.py I'm not sure about how to use custom ones without just copying them in... need to research that. There's no documentation about how to write custom ones here: http://docs.python.org/library/2to3.html You can pass lib2to3 a package to try import fixers from. However I'm not sure how to make that appear from the command line, other than copying the fixer into place. I guess the numpy setup script could copy the fixer into place. btw, in my numpy tree there is a unicode_() alias to str in py3, and to unicode in py2 (inside the compat.py file). This helped us in many cases with compatible string code in the pygame port. This allows you to create unicode strings on both platforms with the same code. Yes, I saw that. The name unicode_ is however already taken by the Numpy scalar type, so we need to think of a different name for it. asstring, maybe. something like numpy.compat.unicode_ ? Btw, do you want to rebase your distutils changes on top of my tree? I tried yours out quickly, but there were some issues there that prevented distutils from working. (Also, you can use absolute imports both for Python 2 and 3 -- there's probably no need to use relative imports.) Pauli hey, yeah I definitely would :) I don't have much time for the next week or so though. cu, ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Fri, Nov 27, 2009 at 1:49 PM, Francesc Alted fal...@pytables.org wrote: Correct. But, in addition, we are going to need a new 'bytes' dtype for NumPy for Python 3, right? I think so. However, I think S is probably closest to bytes... and maybe S can be reused for bytes... I'm not sure though. Also, what will a bytes dtype mean within a py2 program context? Does it matter if the bytes dtype just fails somehow if used in a py2 program? cheers, ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Fri, Nov 27, 2009 at 3:07 PM, René Dudfield ren...@gmail.com wrote: hey, yeah I definitely would :) I don't have much time for the next week or so though. btw, feel free to just copy whatever you like from there into your tree. cheers, ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Friday 27 November 2009 15:09:00 René Dudfield escrigué: On Fri, Nov 27, 2009 at 1:49 PM, Francesc Alted fal...@pytables.org wrote: Correct. But, in addition, we are going to need a new 'bytes' dtype for NumPy for Python 3, right? I think so. However, I think S is probably closest to bytes... and maybe S can be reused for bytes... I'm not sure though. That could be a good idea because that would ensure compatibility with existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it should). The only thing that I don't like is that that 'S' seems to be the initial letter for 'string', which is actually 'unicode' in Python 3 :-/ But, for the sake of compatibility, we can probably live with that. Also, what will a bytes dtype mean within a py2 program context? Does it matter if the bytes dtype just fails somehow if used in a py2 program? Mmh, I'm of the opinion that the new 'bytes' type should be available only with NumPy for Python 3. Would that be possible? -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
pe, 2009-11-27 kello 16:33 +0100, Francesc Alted kirjoitti: A Friday 27 November 2009 15:09:00 René Dudfield escrigué: On Fri, Nov 27, 2009 at 1:49 PM, Francesc Alted fal...@pytables.org wrote: Correct. But, in addition, we are going to need a new 'bytes' dtype for NumPy for Python 3, right? I think so. However, I think S is probably closest to bytes... and maybe S can be reused for bytes... I'm not sure though. That could be a good idea because that would ensure compatibility with existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it should). The only thing that I don't like is that that 'S' seems to be the initial letter for 'string', which is actually 'unicode' in Python 3 :-/ But, for the sake of compatibility, we can probably live with that. Well, we can deprecate 'S' (ie. never show it in repr, always only 'B' or 'U'). Also, what will a bytes dtype mean within a py2 program context? Does it matter if the bytes dtype just fails somehow if used in a py2 program? Mmh, I'm of the opinion that the new 'bytes' type should be available only with NumPy for Python 3. Would that be possible? I don't see a problem in making a bytes_ scalar type available for Python2. In fact, it would be useful for making upgrading to Py3 easier. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
A Friday 27 November 2009 16:41:04 Pauli Virtanen escrigué: I think so. However, I think S is probably closest to bytes... and maybe S can be reused for bytes... I'm not sure though. That could be a good idea because that would ensure compatibility with existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it should). The only thing that I don't like is that that 'S' seems to be the initial letter for 'string', which is actually 'unicode' in Python 3 :-/ But, for the sake of compatibility, we can probably live with that. Well, we can deprecate 'S' (ie. never show it in repr, always only 'B' or 'U'). Well, deprecating 'S' seems a sensible option too. But why only avoiding showing it in repr? Why not issue a DeprecationWarning too? Also, what will a bytes dtype mean within a py2 program context? Does it matter if the bytes dtype just fails somehow if used in a py2 program? Mmh, I'm of the opinion that the new 'bytes' type should be available only with NumPy for Python 3. Would that be possible? I don't see a problem in making a bytes_ scalar type available for Python2. In fact, it would be useful for making upgrading to Py3 easier. I think introducing a bytes_ scalar dtype can be somewhat confusing for Python 2 users. But if the 'S' typecode is to be deprecated also for NumPy for Python 2, then it makes perfect sense to introduce bytes_ there too. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
The point is that I don't think we can just decide to use Unicode or Bytes in all places where PyString was used earlier. Agreed. I think it's helpful to remember the origins of all this: IMHO, there are two distinct types of data that Python2 strings support: 1) text: this is the traditional string. 2) bytes: raw bytes -- they could represent anything. This, of course, is what the py3k string and bytes types are all about. However, when python started, it just so happened that text was represented by an array of unsigned single byte integers, so there really was no point in having a bytes type, as a string would work just as well. Enter unicode: Now we have multiple ways of representing text internally, but want a single interface to that -- one that looks and acts like a sequence of characters to user's code. The result is that the unicode type was introduced. In a way, unicode strings are a bit like arrays: they have an encoding associated with them (like a dtype in numpy). You can represent a given bit of text in multiple different arangements of bytes, but they are all supposed to mean the same thing and, if you know the encoding, you can convert between them. This is kind of like how one can represent 5 in any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any value represented by one dtype can be converted to all other dtypes, but many can. Just like encodings. Anyway, all this brings me to think about the use of strings in numpy in this way: if it is meant to be a human-readable piece of text, it should be a unicode object. If not, then it is bytes. So: fromstring and the like should, of course, work with bytes (though maybe buffers really...) Which one it will be should depend on the use. Users will expect that eg. array([1,2,3], dtype='f4') still works, and they don't have to do e.g. array([1,2,3], dtype=b'f4'). Personally, I try to use np.float32 instead, anyway, but I digress. In this case, the type code is supposed to be a human-readable bit of text -- it should be a unicode object (convertible to ascii for interfacing with C...) If we used b'f4', it would confuse things, as it couldn't be printed right. Also: would the actual bytes involved potentially change depending on what encoding was used for the literal? i.e. if the code was written in utf16, would that byte string be 4 bytes long? To summarize the use cases I've ran across so far: 1) For 'S' dtype, I believe we use Bytes for the raw data and the interface. I don't think so here. 'S' is usually used to store human-readable strings, I'd certainly expect to be able to do: s_array = np.array(['this', 'that'], dtype='S10') And I'd expect it to work with non-literals that were unicode strings, i.e. human readable text. In fact, it's pretty rare that I'd ever want bytes here. So I'd see 'S' mapped to 'U' here. Francesc Alted wrote: the next should still work: In [2]: s = np.array(['asa'], dtype=S10) In [3]: s[0] Out[3]: 'asa' # will become b'asa' in Python 3 I don't like that -- I put in a string, and get a bytes object back? In [4]: s.dtype.itemsize Out[4]: 10 # still 1-byte per element But what it the the strings passed in aren't representable in one byte per character? Do we define S as only supporting ANSI-only string? what encoding? Pauli Virtanen wrote: 'U' is same as Python 3 unicode and probably in same internal representation (need to check). Neither is associated with encoding info. Isn't it? I thought the encoding was always the same internally? so it is known? Francesc Alted wrote: That could be a good idea because that would ensure compatibility with existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it should). What do you mean by compatible? It wold mean a lot of user code would have to change with the 2-3 transition. The only thing that I don't like is that that 'S' seems to be the initial letter for 'string', which is actually 'unicode' in Python 3 :-/ But, for the sake of compatibility, we can probably live with that. I suppose we could at least depricate it. Also, what will a bytes dtype mean within a py2 program context? Does it matter if the bytes dtype just fails somehow if used in a py2 program? well, it should work in 2.6 anyway. Maybe we want to introduce a separate bytes dtype that's an alias for 'S'? What do we need bytes for? does it support anything that np.uint8 doesn't? 2) The field names: a = array([], dtype=[('a', int)]) a = array([], dtype=[(b'a', int)]) This is somewhat of an internal issue. We need to decide whether we internally coerce input to Unicode or Bytes. Unicode is clear to me here -- it really should match what Python does for variable names -- that is unicode in py3k, no? 3) Format strings a = array([], dtype=b'i4') I don't think it makes sense to handle format strings in Unicode internally -- they
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
pe, 2009-11-27 kello 10:36 -0800, Christopher Barker kirjoitti: [clip] Which one it will be should depend on the use. Users will expect that eg. array([1,2,3], dtype='f4') still works, and they don't have to do e.g. array([1,2,3], dtype=b'f4'). Personally, I try to use np.float32 instead, anyway, but I digress. In this case, the type code is supposed to be a human-readable bit of text -- it should be a unicode object (convertible to ascii for interfacing with C...) Yes, this would solve the repr() issue easily. Now that I look more closely, the format strings are not actually used anywhere else than in the descriptor user interface, so from an implementation POV Unicode is not any harder. [clip] Pauli Virtanen wrote: 'U' is same as Python 3 unicode and probably in same internal representation (need to check). Neither is associated with encoding info. Isn't it? I thought the encoding was always the same internally? so it is known? Yes, so it needs not be associated with a separate piece of encoding info. [clip] Maybe we want to introduce a separate bytes dtype that's an alias for 'S'? What do we need bytes for? does it support anything that np.uint8 doesn't? It has a string representation, but that's probably it. Actually, in Python 3, when you index a bytes object, you get integers back, so we just aliasing bytes_ = uint8 and making sure array() handles byte objects appropriately would be more or less consistent. 2) The field names: a = array([], dtype=[('a', int)]) a = array([], dtype=[(b'a', int)]) This is somewhat of an internal issue. We need to decide whether we internally coerce input to Unicode or Bytes. Unicode is clear to me here -- it really should match what Python does for variable names -- that is unicode in py3k, no? Yep, let's follow Python. So Unicode and only Unicode it is. *** Ok, thanks for the feedback. The right answers seem to be: 1) Unicode works as it is now, and Python3 strings are Unicode. Bytes objects are coerced to uint8 by array(). We don't do implicit conversions between Bytes and Unicode. The 'S' dtype character will be deprecated, never appear in repr(), and its usage will result to a warning. 2) Field names are unicode always. Some backward compatibility needs to be added in pickling, and maybe the npy file format needs a fixed encoding. 3) Dtype strings are an user interface detail, and will be Unicode. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
2009/11/27 Christopher Barker chris.bar...@noaa.gov: The point is that I don't think we can just decide to use Unicode or Bytes in all places where PyString was used earlier. Agreed. I only half agree. It seems to me that for almost all situations where PyString was used, the right data type is a python3 string (which is unicode). I realize there may be some few cases where it is appropriate to use bytes, but I think there needs to be a compelling reason for each one. In a way, unicode strings are a bit like arrays: they have an encoding associated with them (like a dtype in numpy). You can represent a given bit of text in multiple different arangements of bytes, but they are all supposed to mean the same thing and, if you know the encoding, you can convert between them. This is kind of like how one can represent 5 in any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any value represented by one dtype can be converted to all other dtypes, but many can. Just like encodings. This is incorrect. Unicode objects do not have default encodings or multiple internal representations (within a single python interpreter, at least). Unicode objects use 2- or 4-byte internal representations internally, but this is almost invisible to the user. Encodings only become relevant when you want to convert a unicode object to a byte stream. It is usually an error to store text in a byte stream (for it to make sense you must provide some mechanism to specify the encoding). Anyway, all this brings me to think about the use of strings in numpy in this way: if it is meant to be a human-readable piece of text, it should be a unicode object. If not, then it is bytes. So: fromstring and the like should, of course, work with bytes (though maybe buffers really...) I think if you're going to call it fromstring, it should onvert from strings (i.e. unicode strings). But really, I think it makes more sense to rename it frombytes() and have it convert bytes objects. One could then have def fromstring(s, encoding=utf-8): return frombytes(s.encode(encoding)) as a shortcut. Maybe ASCII makes more sense as a default encoding. But really, think about where the user's going to get the srting: most of the time it's coming from a disk file or a network stream, so it will be a byte string already, so they should use frombytes. To summarize the use cases I've ran across so far: 1) For 'S' dtype, I believe we use Bytes for the raw data and the interface. I don't think so here. 'S' is usually used to store human-readable strings, I'd certainly expect to be able to do: s_array = np.array(['this', 'that'], dtype='S10') And I'd expect it to work with non-literals that were unicode strings, i.e. human readable text. In fact, it's pretty rare that I'd ever want bytes here. So I'd see 'S' mapped to 'U' here. +1 Francesc Alted wrote: the next should still work: In [2]: s = np.array(['asa'], dtype=S10) In [3]: s[0] Out[3]: 'asa' # will become b'asa' in Python 3 I don't like that -- I put in a string, and get a bytes object back? I agree. In [4]: s.dtype.itemsize Out[4]: 10 # still 1-byte per element But what it the the strings passed in aren't representable in one byte per character? Do we define S as only supporting ANSI-only string? what encoding? Itemsize will change. That's fine. 3) Format strings a = array([], dtype=b'i4') I don't think it makes sense to handle format strings in Unicode internally -- they should always be coerced to bytes. This should be fine -- we control what is a valid format string, and thus they can always be ASCII-safe. I have to disagree. Why should we force the user to use bytes? The format strings are just that, strings, and we should be able to supply python strings to them. Keep in mind that coercing strings to bytes requires extra information, namely the encoding. If you want to emulate python2's value-dependent coercion - raise an exception only if non-ASCII is present - keep in mind that python3 is specifically removing that behaviour because of the problems it caused. Anne ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
Anne Archibald wrote: I don't think it makes sense to handle format strings in Unicode internally -- they should always be coerced to bytes. This should be fine -- we control what is a valid format string, and thus they can always be ASCII-safe. I have to disagree. Why should we force the user to use bytes? One of us mis-understood that -- I THINK the idea was that internally numpy would use bytes (for easy conversion to/from char*), but they would get converted, so the use could pass in unicode strings (or bytes). I guess the questions remains as to what you'd get when you printed a format string. Keep in mind that coercing strings to bytes requires extra information, namely the encoding. but that is built-in to the unicode object. I think the idea is that a format string is ALWAYS ASCII -f there are any other characters in there, it's an invalid format anyway. Unless I mis-understand what a format string is. I think it's a string you use to represent a custom dtype -- it that right? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
Francesc Alted wrote: A Friday 27 November 2009 16:41:04 Pauli Virtanen escrigué: I think so. However, I think S is probably closest to bytes... and maybe S can be reused for bytes... I'm not sure though. That could be a good idea because that would ensure compatibility with existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it should). The only thing that I don't like is that that 'S' seems to be the initial letter for 'string', which is actually 'unicode' in Python 3 :-/ But, for the sake of compatibility, we can probably live with that. Well, we can deprecate 'S' (ie. never show it in repr, always only 'B' or 'U'). Well, deprecating 'S' seems a sensible option too. But why only avoiding showing it in repr? Why not issue a DeprecationWarning too? One thing to keep in mind here is that PEP 3118 actually defines a standard dtype format string, which is (mostly) incompatible with NumPy's. It should probably be supported as well when PEP 3118 is implemented. Just something to keep in the back of ones mind when discussing this. For instance one could, instead of inventing something new, adopt the characters PEP 3118 uses (if there isn't a conflict): - b: Raw byte - c: ucs-1 encoding (latin 1, one byte) - u: ucs-2 encoding, two bytes - w: ucs-4 encoding, four bytes Long-term I hope the NumPy-specific format string will be deprecated, so that repr print out the PEP 3118 format string etc. But, I'm aware that API breakage shouldn't happen when porting to Python 3. -- Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Bytes vs. Unicode in Python3
Hi, The Python 3 porting needs some decisions on what is Bytes and what is Unicode. I'm currently taking the following approach. Comments? *** dtype field names Either Bytes or Unicode. But 'a' and b'a' are *different* fields. The issue is that: Python 2: {'a': 2}[u'a'] == 2, {u'a': 2}['a'] == 2 Python 3: {'a': 2}[b'a'], {b'a': 2}['a'] raise exceptions so the current assumptions in the C code of u'a' == b'a' cease to hold. dtype titles If Bytes or Unicode, work similarly as field names. dtype format strings, datetime tuple, and any other protocol strings Bytes. User can pass in Unicode, but it's converted using UTF8 codec. This will likely change repr() of various objects. Acceptable? -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
Hi Pauli, On Thu, Nov 26, 2009 at 4:08 PM, Pauli Virtanen p...@iki.fi wrote: Hi, The Python 3 porting needs some decisions on what is Bytes and what is Unicode. I'm currently taking the following approach. Comments? *** dtype field names Either Bytes or Unicode. But 'a' and b'a' are *different* fields. The issue is that: Python 2: {'a': 2}[u'a'] == 2, {u'a': 2}['a'] == 2 Python 3: {'a': 2}[b'a'], {b'a': 2}['a'] raise exceptions so the current assumptions in the C code of u'a' == b'a' cease to hold. dtype titles If Bytes or Unicode, work similarly as field names. dtype format strings, datetime tuple, and any other protocol strings Bytes. User can pass in Unicode, but it's converted using UTF8 codec. This will likely change repr() of various objects. Acceptable? I'm not clear on your recommendation here, is it that we should use bytes, with unicode converted to UTF8? Will that support arrays that have been pickled and such? Or will we just have a minimum of code to fix up? And could you expand on the changes that repr() might undergo? Mind, I think using bytes sounds best, but I haven't looked into the whole strings part of the transition and don't have an informed opinion on the matter. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bytes vs. Unicode in Python3
On Fri, Nov 27, 2009 at 1:37 AM, Charles R Harris charlesr.har...@gmail.com wrote: Hi Pauli, On Thu, Nov 26, 2009 at 4:08 PM, Pauli Virtanen p...@iki.fi wrote: Hi, The Python 3 porting needs some decisions on what is Bytes and what is Unicode. I'm currently taking the following approach. Comments? *** dtype field names Either Bytes or Unicode. But 'a' and b'a' are *different* fields. The issue is that: Python 2: {'a': 2}[u'a'] == 2, {u'a': 2}['a'] == 2 Python 3: {'a': 2}[b'a'], {b'a': 2}['a'] raise exceptions so the current assumptions in the C code of u'a' == b'a' cease to hold. dtype titles If Bytes or Unicode, work similarly as field names. dtype format strings, datetime tuple, and any other protocol strings Bytes. User can pass in Unicode, but it's converted using UTF8 codec. This will likely change repr() of various objects. Acceptable? I'm not clear on your recommendation here, is it that we should use bytes, with unicode converted to UTF8? Will that support arrays that have been pickled and such? Or will we just have a minimum of code to fix up? And could you expand on the changes that repr() might undergo? Mind, I think using bytes sounds best, but I haven't looked into the whole strings part of the transition and don't have an informed opinion on the matter. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion To help clarify for people who are not familiar with python 3... To put it simply... in py3, str == unicode utf-8, there is no 'unicode' anymore. bytes == raw data... but kind of like a str type with less methods. bytes + encoding is how you do non utf-8 strings. 'array' exists in both py2 and py3 with a very similar interface on both. There's a more precise description of strings in python3 on these pages: http://diveintopython3.org/strings.html http://diveintopython3.org/porting-code-to-python-3-with-2to3.html It depends on the use cases for each thing which will depend on how it should work imho. Mostly if you are using the str type, then keep using the str type. Many functions take both bytes and strings. Since it is sane to work on both bytes and strings from a users perspective. There have been some methods in the stdlib that have not consumed both, and they have been treated as bugs, and are being fixed (eg, some urllib methods). For dtype, using the python 'str' by default seems ok. Since all of those characters come out in the same manner on both pythons for the data used by numpy. eg. 'float32' is shown the same as a py3 string as a py2 string. Internally it is unicode data however. Within py2, we save a pickle with the str: import pickle pickle.dump(s, open('/tmp/p.pickle', 'wb')) pickle.dump(s, open('/tmp/p.pickle', 'wb')) pickle.dump('float32', open('/tmp/p.pickle', 'wb')) Within py3 we open the pickle with the str: import pickle pickle.load(open('/tmp/p.pickle', 'rb')) 'float32' cheers, ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] .bytes
Nadav Horesh wrote: array(1, dtype=float32).itemsize ok, it will work fine for my purpose. In numpy, is there any reason to supress the attribute .bytes from the type object itself ? Is it simply because the native python types (int, float, complex, etc.) do not have this attribute ? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion -- (o o) oOO--(_)--OOo--- Yves Revaz Lerma Batiment A Tel : ++ 33 (0) 1 40 51 20 79 Observatoire de Paris Fax : ++ 33 (0) 1 40 51 20 02 77 av Denfert-Rochereaue-mail : [EMAIL PROTECTED] F-75014 Paris Web : http://obswww.unige.ch/~revaz/ FRANCE ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] .bytes
Yves Revaz wrote: Nadav Horesh wrote: array(1, dtype=float32).itemsize ok, it will work fine for my purpose. In numpy, is there any reason to supress the attribute .bytes from the type object itself ? Is it simply because the native python types (int, float, complex, etc.) do not have this attribute ? The problem is that the instances of the scalar types do have the itemsize attribute. The implementation of type objects is such that the type object will also have that attribute, but it will be a stub: In [15]: float64.itemsize Out[15]: attribute 'itemsize' of 'numpy.generic' objects A more straightforward way to get the itemsize is this: In [17]: dtype(float64).itemsize Out[17]: 8 -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] .bytes
Dear list, I'm translating codes from numarray to numpy. Unfortunately, I'm unable to find the equivalent of the command that give the number of bytes for a given type : using numarray I used : Float32.bytes 4 I'm sure there is a solution in numpy, but I'm unable to find it. Thanks, Yves -- (o o) oOO--(_)--OOo--- Yves Revaz Lerma Batiment A Tel : ++ 33 (0) 1 40 51 20 79 Observatoire de Paris Fax : ++ 33 (0) 1 40 51 20 02 77 av Denfert-Rochereaue-mail : [EMAIL PROTECTED] F-75014 Paris Web : http://www.lunix.ch/revaz/ FRANCE ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] .bytes
Hi, In the description field, you have itemsize which is what you want. Matthieu 2007/10/14, Yves Revaz [EMAIL PROTECTED]: Dear list, I'm translating codes from numarray to numpy. Unfortunately, I'm unable to find the equivalent of the command that give the number of bytes for a given type : using numarray I used : Float32.bytes 4 I'm sure there is a solution in numpy, but I'm unable to find it. Thanks, Yves -- (o o) oOO--(_)--OOo--- Yves Revaz Lerma Batiment A Tel : ++ 33 (0) 1 40 51 20 79 Observatoire de Paris Fax : ++ 33 (0) 1 40 51 20 02 77 av Denfert-Rochereaue-mail : [EMAIL PROTECTED] F-75014 Paris Web : http://www.lunix.ch/revaz/ FRANCE ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion