Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-09 Thread Francesc Alted
A Sunday 06 December 2009 11:47:23 Francesc Alted escrigué:
 A Saturday 05 December 2009 11:16:55 Dag Sverre Seljebotn escrigué:
   In [19]: t = np.dtype(i4,f4)
  
   In [20]: t
   Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])
  
   In [21]: hash(t)
   Out[21]: -9041335829180134223
  
   In [22]: t.names = ('one', 'other')
  
   In [23]: t
   Out[23]: dtype([('one', 'i4'), ('other', 'f4')])
  
   In [24]: hash(t)
   Out[24]: 8637734220020415106
  
   Perhaps this should be marked as a bug?  I'm not sure about that,
   because the above seems quite useful.
 
  Well, I for one don't like this, but that's just an opinion. I think it
  is unwise to leave object which supports hash() mutable, because it's
  too easy to make hard to find bugs (sticking a dtype as a key in a dict
  is rather useful in many situations). There's a certain tradition in
  Python for leaving types immutable if possible, and dtype certainly
  feels like it.
 
 Yes, I think you are right and force dtype to be immutable would be the
  best.

I've filed a ticket so that we don't loose track of this:

http://projects.scipy.org/numpy/ticket/1321

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-06 Thread Francesc Alted
A Saturday 05 December 2009 11:16:55 Dag Sverre Seljebotn escrigué:
  Mmh, the only case that I'm aware about dtype *mutability* is changing
  the names of compound types:
 
  In [19]: t = np.dtype(i4,f4)
 
  In [20]: t
  Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])
 
  In [21]: hash(t)
  Out[21]: -9041335829180134223
 
  In [22]: t.names = ('one', 'other')
 
  In [23]: t
  Out[23]: dtype([('one', 'i4'), ('other', 'f4')])
 
  In [24]: hash(t)
  Out[24]: 8637734220020415106
 
  Perhaps this should be marked as a bug?  I'm not sure about that, because
  the above seems quite useful.
 
 Well, I for one don't like this, but that's just an opinion. I think it
 is unwise to leave object which supports hash() mutable, because it's
 too easy to make hard to find bugs (sticking a dtype as a key in a dict
 is rather useful in many situations). There's a certain tradition in
 Python for leaving types immutable if possible, and dtype certainly
 feels like it.

Yes, I think you are right and force dtype to be immutable would be the best.  
As a bonus, an immutable dtype would render this ticket:

http://projects.scipy.org/numpy/ticket/1127

without effect.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-05 Thread Dag Sverre Seljebotn
Francesc Alted wrote:
 A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué:
 Pauli Virtanen wrote:
 Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
 [clip]

 Great! Are you storing the format string in the dtype types as well? (So
 that no release is needed and acquisitions are cheap...)
 I regenerate it on each buffer acquisition. It's simple low-level C code,
 and I suspect it will always be fast enough. Of course, we could *cache*
 the result in the dtype. (If dtypes are immutable, which I don't remember
 right now.)
 We discussed this at SciPy 09 -- basically, they are not necesarrily
 immutable in implementation, but anywhere they are not that is a bug and
 no code should depend on their mutability, so we are free to assume so.
 
 Mmh, the only case that I'm aware about dtype *mutability* is changing the 
 names of compound types:
 
 In [19]: t = np.dtype(i4,f4)
 
 In [20]: t
 Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])
 
 In [21]: hash(t)
 Out[21]: -9041335829180134223
 
 In [22]: t.names = ('one', 'other')
 
 In [23]: t
 Out[23]: dtype([('one', 'i4'), ('other', 'f4')])
 
 In [24]: hash(t)
 Out[24]: 8637734220020415106
 
 Perhaps this should be marked as a bug?  I'm not sure about that, because the 
 above seems quite useful.

Well, I for one don't like this, but that's just an opinion. I think it 
is unwise to leave object which supports hash() mutable, because it's 
too easy to make hard to find bugs (sticking a dtype as a key in a dict 
is rather useful in many situations). There's a certain tradition in 
Python for leaving types immutable if possible, and dtype certainly 
feels like it.

Anyway, the buffer PEP can be supported simply by updating the buffer 
format string on the names setter, so it's an orthogonal issue.

BTW note that the buffer PEP provides for supplying names of fields:

T{
  i:one:
  f:other:
}

(or similar). NumPy should probably do so at one point in the future; 
the Cython implementation doesn't because Cython doesn't use this 
information.

-- 
Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-05 Thread David Cournapeau
On Sat, Dec 5, 2009 at 7:16 PM, Dag Sverre Seljebotn
da...@student.matnat.uio.no wrote:

 Perhaps this should be marked as a bug?  I'm not sure about that, because the
 above seems quite useful.

 Well, I for one don't like this, but that's just an opinion. I think it
 is unwise to leave object which supports hash() mutable, because it's
 too easy to make hard to find bugs (sticking a dtype as a key in a dict
 is rather useful in many situations). There's a certain tradition in
 Python for leaving types immutable if possible, and dtype certainly
 feels like it.

I agree the behavior is a bit surprising, but I don't know if code
relies on compound dtype names to be immutable out there. Also, the
fact that names attribute is a tuple and not a list also suggests that
the intent is to be immutable.

I am more worried about the variations between python versions ATM,
though, I have no idea where it is coming from.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-04 Thread Francesc Alted
A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué:
 Pauli Virtanen wrote:
  Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
  [clip]
 
  Great! Are you storing the format string in the dtype types as well? (So
  that no release is needed and acquisitions are cheap...)
 
  I regenerate it on each buffer acquisition. It's simple low-level C code,
  and I suspect it will always be fast enough. Of course, we could *cache*
  the result in the dtype. (If dtypes are immutable, which I don't remember
  right now.)
 
 We discussed this at SciPy 09 -- basically, they are not necesarrily
 immutable in implementation, but anywhere they are not that is a bug and
 no code should depend on their mutability, so we are free to assume so.

Mmh, the only case that I'm aware about dtype *mutability* is changing the 
names of compound types:

In [19]: t = np.dtype(i4,f4)

In [20]: t
Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])

In [21]: hash(t)
Out[21]: -9041335829180134223

In [22]: t.names = ('one', 'other')

In [23]: t
Out[23]: dtype([('one', 'i4'), ('other', 'f4')])

In [24]: hash(t)
Out[24]: 8637734220020415106

Perhaps this should be marked as a bug?  I'm not sure about that, because the 
above seems quite useful.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-04 Thread David Cournapeau
On Fri, Dec 4, 2009 at 9:23 PM, Francesc Alted fal...@pytables.org wrote:
 A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué:
 Pauli Virtanen wrote:
  Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
  [clip]
 
  Great! Are you storing the format string in the dtype types as well? (So
  that no release is needed and acquisitions are cheap...)
 
  I regenerate it on each buffer acquisition. It's simple low-level C code,
  and I suspect it will always be fast enough. Of course, we could *cache*
  the result in the dtype. (If dtypes are immutable, which I don't remember
  right now.)

 We discussed this at SciPy 09 -- basically, they are not necesarrily
 immutable in implementation, but anywhere they are not that is a bug and
 no code should depend on their mutability, so we are free to assume so.

 Mmh, the only case that I'm aware about dtype *mutability* is changing the
 names of compound types:

 In [19]: t = np.dtype(i4,f4)

 In [20]: t
 Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])

 In [21]: hash(t)
 Out[21]: -9041335829180134223

 In [22]: t.names = ('one', 'other')

 In [23]: t
 Out[23]: dtype([('one', 'i4'), ('other', 'f4')])

 In [24]: hash(t)
 Out[24]: 8637734220020415106

 Perhaps this should be marked as a bug?  I'm not sure about that, because the
 above seems quite useful.

Hm, that's strange - I get the same hash in both cases, but I thought
I took into account names when I implemented the hashing protocol for
dtype. Which version of numpy on which os are you seeing this ?

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-04 Thread Bruce Southey
On 12/04/2009 10:12 AM, David Cournapeau wrote:
 On Fri, Dec 4, 2009 at 9:23 PM, Francesc Altedfal...@pytables.org  wrote:

 A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué:
  
 Pauli Virtanen wrote:

 Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
 [clip]

  
 Great! Are you storing the format string in the dtype types as well? (So
 that no release is needed and acquisitions are cheap...)

 I regenerate it on each buffer acquisition. It's simple low-level C code,
 and I suspect it will always be fast enough. Of course, we could *cache*
 the result in the dtype. (If dtypes are immutable, which I don't remember
 right now.)
  
 We discussed this at SciPy 09 -- basically, they are not necesarrily
 immutable in implementation, but anywhere they are not that is a bug and
 no code should depend on their mutability, so we are free to assume so.

 Mmh, the only case that I'm aware about dtype *mutability* is changing the
 names of compound types:

 In [19]: t = np.dtype(i4,f4)

 In [20]: t
 Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])

 In [21]: hash(t)
 Out[21]: -9041335829180134223

 In [22]: t.names = ('one', 'other')

 In [23]: t
 Out[23]: dtype([('one', 'i4'), ('other', 'f4')])

 In [24]: hash(t)
 Out[24]: 8637734220020415106

 Perhaps this should be marked as a bug?  I'm not sure about that, because the
 above seems quite useful.
  
 Hm, that's strange - I get the same hash in both cases, but I thought
 I took into account names when I implemented the hashing protocol for
 dtype. Which version of numpy on which os are you seeing this ?

 David
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

Hi,
On the same linux 64-bit Fedora 11, I get the same hash with Python2.4 
and numpy 1.3 but different hashes for Python2.6 and numpy 1.4.

Bruce

Python 2.6 (r26:66714, Jun  8 2009, 16:07:29)
[GCC 4.4.0 20090506 (Red Hat 4.4.0-4)] on linux2
Type help, copyright, credits or license for more information.
  import numpy as np
  np.__version__
'1.4.0.dev7750'
  t = np.dtype(i4,f4)
  t
dtype([('f0', 'i4'), ('f1', 'f4')])
  hash(t)
-9041335829180134223
  t.names = ('one', 'other')
  t
dtype([('one', 'i4'), ('other', 'f4')])
  hash(t)
8637734220020415106


Python 2.4.5 (#1, Oct  6 2008, 09:54:35)
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type help, copyright, credits or license for more information.
  import numpy as np
  np.__version__
'1.3.0.dev6653'
  t = np.dtype(i4,f4)
  hash(t)
140053539914640
  t.names = ('one', 'other')
  hash(t)
140053539914640



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-04 Thread David Cournapeau
On Sat, Dec 5, 2009 at 1:31 AM, Bruce Southey bsout...@gmail.com wrote:
 On 12/04/2009 10:12 AM, David Cournapeau wrote:
 On Fri, Dec 4, 2009 at 9:23 PM, Francesc Altedfal...@pytables.org  wrote:

 A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué:

 Pauli Virtanen wrote:

 Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
 [clip]


 Great! Are you storing the format string in the dtype types as well? (So
 that no release is needed and acquisitions are cheap...)

 I regenerate it on each buffer acquisition. It's simple low-level C code,
 and I suspect it will always be fast enough. Of course, we could *cache*
 the result in the dtype. (If dtypes are immutable, which I don't remember
 right now.)

 We discussed this at SciPy 09 -- basically, they are not necesarrily
 immutable in implementation, but anywhere they are not that is a bug and
 no code should depend on their mutability, so we are free to assume so.

 Mmh, the only case that I'm aware about dtype *mutability* is changing the
 names of compound types:

 In [19]: t = np.dtype(i4,f4)

 In [20]: t
 Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])

 In [21]: hash(t)
 Out[21]: -9041335829180134223

 In [22]: t.names = ('one', 'other')

 In [23]: t
 Out[23]: dtype([('one', 'i4'), ('other', 'f4')])

 In [24]: hash(t)
 Out[24]: 8637734220020415106

 Perhaps this should be marked as a bug?  I'm not sure about that, because 
 the
 above seems quite useful.

 Hm, that's strange - I get the same hash in both cases, but I thought
 I took into account names when I implemented the hashing protocol for
 dtype. Which version of numpy on which os are you seeing this ?

 David
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 Hi,
 On the same linux 64-bit Fedora 11, I get the same hash with Python2.4
 and numpy 1.3 but different hashes for Python2.6 and numpy 1.4.

Could you check the behavior of 1.4.0 on 2.4 ? The code doing hashing
for dtypes has not changed since 1.3.0, so normally only the python
should have an influence.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-04 Thread Francesc Alted
A Friday 04 December 2009 17:12:09 David Cournapeau escrigué:
  Mmh, the only case that I'm aware about dtype *mutability* is changing
  the names of compound types:
 
  In [19]: t = np.dtype(i4,f4)
 
  In [20]: t
  Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])
 
  In [21]: hash(t)
  Out[21]: -9041335829180134223
 
  In [22]: t.names = ('one', 'other')
 
  In [23]: t
  Out[23]: dtype([('one', 'i4'), ('other', 'f4')])
 
  In [24]: hash(t)
  Out[24]: 8637734220020415106
 
  Perhaps this should be marked as a bug?  I'm not sure about that, because
  the above seems quite useful.
 
 Hm, that's strange - I get the same hash in both cases, but I thought
 I took into account names when I implemented the hashing protocol for
 dtype. Which version of numpy on which os are you seeing this ?

numpy: 1.4.0.dev7072
python: 2.6.1

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-04 Thread David Cournapeau
On Sat, Dec 5, 2009 at 1:57 AM, David Cournapeau courn...@gmail.com wrote:
 On Sat, Dec 5, 2009 at 1:31 AM, Bruce Southey bsout...@gmail.com wrote:
 On 12/04/2009 10:12 AM, David Cournapeau wrote:
 On Fri, Dec 4, 2009 at 9:23 PM, Francesc Altedfal...@pytables.org  wrote:

 A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué:

 Pauli Virtanen wrote:

 Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
 [clip]


 Great! Are you storing the format string in the dtype types as well? (So
 that no release is needed and acquisitions are cheap...)

 I regenerate it on each buffer acquisition. It's simple low-level C code,
 and I suspect it will always be fast enough. Of course, we could *cache*
 the result in the dtype. (If dtypes are immutable, which I don't remember
 right now.)

 We discussed this at SciPy 09 -- basically, they are not necesarrily
 immutable in implementation, but anywhere they are not that is a bug and
 no code should depend on their mutability, so we are free to assume so.

 Mmh, the only case that I'm aware about dtype *mutability* is changing the
 names of compound types:

 In [19]: t = np.dtype(i4,f4)

 In [20]: t
 Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])

 In [21]: hash(t)
 Out[21]: -9041335829180134223

 In [22]: t.names = ('one', 'other')

 In [23]: t
 Out[23]: dtype([('one', 'i4'), ('other', 'f4')])

 In [24]: hash(t)
 Out[24]: 8637734220020415106

 Perhaps this should be marked as a bug?  I'm not sure about that, because 
 the
 above seems quite useful.

 Hm, that's strange - I get the same hash in both cases, but I thought
 I took into account names when I implemented the hashing protocol for
 dtype. Which version of numpy on which os are you seeing this ?

 David
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

 Hi,
 On the same linux 64-bit Fedora 11, I get the same hash with Python2.4
 and numpy 1.3 but different hashes for Python2.6 and numpy 1.4.

 Could you check the behavior of 1.4.0 on 2.4 ? The code doing hashing
 for dtypes has not changed since 1.3.0, so normally only the python
 should have an influence.

When I say should, it should be understood as this is the only reason
why I think it could be different - the behavior should certainly not
depend on the python version.

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-04 Thread Bruce Southey
On 12/04/2009 10:57 AM, David Cournapeau wrote:
 On Sat, Dec 5, 2009 at 1:31 AM, Bruce Southeybsout...@gmail.com  wrote:

 On 12/04/2009 10:12 AM, David Cournapeau wrote:
  
 On Fri, Dec 4, 2009 at 9:23 PM, Francesc Altedfal...@pytables.org
 wrote:


 A Thursday 03 December 2009 14:56:16 Dag Sverre Seljebotn escrigué:

  
 Pauli Virtanen wrote:


 Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
 [clip]


  
 Great! Are you storing the format string in the dtype types as well? (So
 that no release is needed and acquisitions are cheap...)


 I regenerate it on each buffer acquisition. It's simple low-level C code,
 and I suspect it will always be fast enough. Of course, we could *cache*
 the result in the dtype. (If dtypes are immutable, which I don't remember
 right now.)

  
 We discussed this at SciPy 09 -- basically, they are not necesarrily
 immutable in implementation, but anywhere they are not that is a bug and
 no code should depend on their mutability, so we are free to assume so.


 Mmh, the only case that I'm aware about dtype *mutability* is changing the
 names of compound types:

 In [19]: t = np.dtype(i4,f4)

 In [20]: t
 Out[20]: dtype([('f0', 'i4'), ('f1', 'f4')])

 In [21]: hash(t)
 Out[21]: -9041335829180134223

 In [22]: t.names = ('one', 'other')

 In [23]: t
 Out[23]: dtype([('one', 'i4'), ('other', 'f4')])

 In [24]: hash(t)
 Out[24]: 8637734220020415106

 Perhaps this should be marked as a bug?  I'm not sure about that, because 
 the
 above seems quite useful.

  
 Hm, that's strange - I get the same hash in both cases, but I thought
 I took into account names when I implemented the hashing protocol for
 dtype. Which version of numpy on which os are you seeing this ?

 David
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


 Hi,
 On the same linux 64-bit Fedora 11, I get the same hash with Python2.4
 and numpy 1.3 but different hashes for Python2.6 and numpy 1.4.
  
 Could you check the behavior of 1.4.0 on 2.4 ? The code doing hashing
 for dtypes has not changed since 1.3.0, so normally only the python
 should have an influence.

 David
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

These are different with Python 2.4 and numpy 1.4. Curiously I got 
different hash values with Python 2.5 and numpy 1.3. (For what it is 
worth, I get the same hash values with Python 2.3 with numpy 1.1.1).

Bruce


Python 2.5.2 (r252:60911, Nov 18 2008, 09:20:42)
[GCC 4.3.2 20081105 (Red Hat 4.3.2-7)] on linux2
Type help, copyright, credits or license for more information.
  import numpy as np
  np.__version__
'1.3.0'
  t = np.dtype(i4,f4)
  hash(t)
-9041335829180134223
  t.names = ('one', 'other')
  hash(t)
8637734220020415106


[bsout...@starling python]$ /usr/local/bin/python2.4
Python 2.4.5 (#1, Oct  6 2008, 09:54:35)
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type help, copyright, credits or license for more information.
  import numpy as np
  np.__version__
'1.4.0rc1'
  t = np.dtype(i4,f4)
  hash(t)
-9041335829180134223
  t.names = ('one', 'other')
  hash(t)
8637734220020415106

Python 2.3.7 (#1, Oct  6 2008, 09:55:54)
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type help, copyright, credits or license for more information.
  import numpy as np
  np.__version__
'1.1.1'
  t = np.dtype(i4,f4)
  hash(t)
140552637936672
  t.names = ('one', 'other')
  hash(t)
140552637936672

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-03 Thread Dag Sverre Seljebotn
Dag Sverre Seljebotn wrote:
 Dag Sverre Seljebotn wrote:
   
 Pauli Virtanen wrote:
   
 
 Fri, 27 Nov 2009 23:19:58 +0100, Dag Sverre Seljebotn wrote:
 [clip]
   
 
   
 One thing to keep in mind here is that PEP 3118 actually defines a
 standard dtype format string, which is (mostly) incompatible with
 NumPy's. It should probably be supported as well when PEP 3118 is
 implemented.
 
   
 
 PEP 3118 is for the most part implemented in my Py3K branch now -- it was 
 not actually much work, as I could steal most of the format string 
 converter from numpy.pxd.
   
 
   
 Great! Are you storing the format string in the dtype types as well? (So 
 that no release is needed and acquisitions are cheap...)

 As far as numpy.pxd goes -- well, for the simplest dtypes.
   
 
 Some questions:

 How hard do we want to try supplying a buffer? Eg. if the consumer does 
 not specify strided but specifies suboffsets, should we try to compute 
 suitable suboffsets? Should we try making contiguous copies of the data 
 (I guess this would break buffer semantics?)?
   
 
   
 Actually per the PEP, suboffsets imply strided:

 #define PyBUF_INDIRECT (0x0100 | PyBUF_STRIDES)

 :-) So there's no real way for a consumer to specify only suboffsets, 
 0x0100 is not a possible flag I think. Suboffsets can't really work 
 without the strides anyway IIUC, and in the case of NumPy the field can 
 always be left at 0.
   
 
 That is, NULL!
   
 IMO one should very much stay clear of making contiguous copies, 
 especially considering the existance of PyBuffer_ToContiguous, which 
 makes it trivial for client code to get a pointer to a contiguous buffer 
 anyway. The intention of the PEP seems to be to export the buffer in as 
 raw form as possible.

 Do keep in mind that IS_C_CONTIGUOUS and IS_F_CONTIGUOUS go be too 
 conservative with NumPy arrays. If a contiguous buffer is requested, 
 then  looping through the strides and checking that the strides are 
 monotonically decreasing/increasing could eventually save copying in 
 some cases. I think that could be worth it -- I actually have my own 
   
 
 And, of course, that the innermost stride is 1.
   
Aargh. Some day I'll find/implement a 10 minute send delay for my email 
program, so I'll catch my errors before the emails go out...

Anyway, this is not sufficient, one must also check correspondance with 
shape, of course.

Dag Sverre

 code for IS_F_CONTIGUOUS rather than relying on the flags personally 
 because of this issue, so it does come up in practice.

 Dag Sverre
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
   
 

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
   

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-03 Thread Pauli Virtanen
Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
[clip]
 Great! Are you storing the format string in the dtype types as well? (So
 that no release is needed and acquisitions are cheap...)

I regenerate it on each buffer acquisition. It's simple low-level C code, 
and I suspect it will always be fast enough. Of course, we could *cache* 
the result in the dtype. (If dtypes are immutable, which I don't remember 
right now.)

Do you have a case in mind where the speed of format string generation 
would be a bottleneck?

 Some questions:

 How hard do we want to try supplying a buffer? Eg. if the consumer does
 not specify strided but specifies suboffsets, should we try to compute
 suitable suboffsets? Should we try making contiguous copies of the data
 (I guess this would break buffer semantics?)?
   
 Actually per the PEP, suboffsets imply strided:
 
 #define PyBUF_INDIRECT (0x0100 | PyBUF_STRIDES)
 
 :-) So there's no real way for a consumer to specify only suboffsets,
 0x0100 is not a possible flag I think. Suboffsets can't really work
 without the strides anyway IIUC, and in the case of NumPy the field can
 always be left at 0.

Ok, great!

 IMO one should very much stay clear of making contiguous copies,
 especially considering the existance of PyBuffer_ToContiguous, which
 makes it trivial for client code to get a pointer to a contiguous buffer
 anyway. The intention of the PEP seems to be to export the buffer in as
 raw form as possible.

This is what I thought, too.

 Do keep in mind that IS_C_CONTIGUOUS and IS_F_CONTIGUOUS go be too
 conservative with NumPy arrays. If a contiguous buffer is requested,
 then  looping through the strides and checking that the strides are
 monotonically decreasing/increasing could eventually save copying in
 some cases. I think that could be worth it -- I actually have my own
 code for IS_F_CONTIGUOUS rather than relying on the flags personally
 because of this issue, so it does come up in practice.

Are you sure?

Assume monotonically increasing or decreasing strides with inner stride 
of itemsize. Now, if the strides are not C or F-contiguous, doesn't this 
imply that part of the data in the memory block is *not* pointed to by a 
set of indices? [For example, strides = {itemsize, 3*itemsize}; dims = 
{2, 2}. Now, there is unused memory between items (1,0) and (0,1).]

This probably boils down to what exactly was meant in the PEP and Python 
docs by contiguous. I'd believe it was meant to be the same as in Numpy 
-- that you can send the array data e.g. to Fortran as-is. If so, there 
should not be gaps in the data, if the client explicitly requested that 
the buffer be contiguous.

Maybe you meant that the Numpy array flags (which the macros check) are 
not always up-to-date wrt. the stride information?

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-12-03 Thread Dag Sverre Seljebotn
Pauli Virtanen wrote:
 Thu, 03 Dec 2009 14:03:13 +0100, Dag Sverre Seljebotn wrote:
 [clip]
   
 Great! Are you storing the format string in the dtype types as well? (So
 that no release is needed and acquisitions are cheap...)
 

 I regenerate it on each buffer acquisition. It's simple low-level C code, 
 and I suspect it will always be fast enough. Of course, we could *cache* 
 the result in the dtype. (If dtypes are immutable, which I don't remember 
 right now.)
   
We discussed this at SciPy 09 -- basically, they are not necesarrily 
immutable in implementation, but anywhere they are not that is a bug and 
no code should depend on their mutability, so we are free to assume so.

 Do you have a case in mind where the speed of format string generation 
 would be a bottleneck?
   
Going all the way down to user code; no. Well, contrived: You have a 
Python list of NumPy arrays and want to sum over the first element of 
each, acquiring the buffer by PEP 3118 (which is easy through Cython). 
In that case I can see all the memory allocation that must go on for 
each element for the format-string as a bottle-neck.

But mostly it's from cleanliness of implementation, like the fact that 
you don't know up-front how long the string need to be for nested dtypes.

Obviously, what you have done is much better than nothing, and probably 
sufficient for nearly all purposes, so I should stop complaining.

 Do keep in mind that IS_C_CONTIGUOUS and IS_F_CONTIGUOUS go be too
 conservative with NumPy arrays. If a contiguous buffer is requested,
 then  looping through the strides and checking that the strides are
 monotonically decreasing/increasing could eventually save copying in
 some cases. I think that could be worth it -- I actually have my own
 code for IS_F_CONTIGUOUS rather than relying on the flags personally
 because of this issue, so it does come up in practice.
 

 Are you sure?

 Assume monotonically increasing or decreasing strides with inner stride 
 of itemsize. Now, if the strides are not C or F-contiguous, doesn't this 
 imply that part of the data in the memory block is *not* pointed to by a 
 set of indices? [For example, strides = {itemsize, 3*itemsize}; dims = 
 {2, 2}. Now, there is unused memory between items (1,0) and (0,1).]

 This probably boils down to what exactly was meant in the PEP and Python 
 docs by contiguous. I'd believe it was meant to be the same as in Numpy 
 -- that you can send the array data e.g. to Fortran as-is. If so, there 
 should not be gaps in the data, if the client explicitly requested that 
 the buffer be contiguous.

 Maybe you meant that the Numpy array flags (which the macros check) are 
 not always up-to-date wrt. the stride information?
   
Yep, this is what I meant, and the rest is wrong. But now that I think 
about it, the case that bit me is

In [14]: np.arange(10)[None, None, :].flags.c_contiguous
Out[14]: False

I suppose this particular case could be fixed properly with little cost 
(if it isn't already). It is probably cleaner to just rely on the flags 
for PEP 3118, less confusion etc. Sorry for the distraction.

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Pauli Virtanen
to, 2009-11-26 kello 17:37 -0700, Charles R Harris kirjoitti:
[clip]
 I'm not clear on your recommendation here, is it that we should use
 bytes, with unicode converted to UTF8?

The point is that I don't think we can just decide to use Unicode or
Bytes in all places where PyString was used earlier. Which one it will
be should depend on the use. Users will expect that eg. array([1,2,3],
dtype='f4') still works, and they don't have to do e.g. array([1,2,3],
dtype=b'f4').

To summarize the use cases I've ran across so far:

1) For 'S' dtype, I believe we use Bytes for the raw data and the
   interface.

   Maybe we want to introduce a separate bytes dtype that's an alias
   for 'S'?

2) The field names:

a = array([], dtype=[('a', int)])
a = array([], dtype=[(b'a', int)])

This is somewhat of an internal issue. We need to decide whether we
internally coerce input to Unicode or Bytes. Or whether we allow for
both Unicode and Bytes (but preserving previous semantics in this case
requires extra work, due to semantic changes in PyDict).

Currently, there's some code in Numpy to allow for Unicode field names,
but it's not been coherently implemented in all places, so e.g. direct
creation of dtypes with unicode field names fails.

This has also implications on field titles, as also those are stored in
the fields dict.

3) Format strings

a = array([], dtype=b'i4')

I don't think it makes sense to handle format strings in Unicode
internally -- they should always be coerced to bytes. This will make it
easier at many points, since it will be enought to do

PyBytes_AS_STRING(str)

to get the char* pointer, rather than having to encode to utf-8 first.
Same for all other similar uses of string, e.g. protocol descriptors.
User input should just be coerced to ASCII on input, I believe.

The problem here is that preserving repr() in this case requires some
extra work. But maybe that has to be done.

 Will that support arrays that have been pickled and such?

Are the pickles backward compatible between Python 2 and 3 at all?
I think using Bytes for format strings will be backward-compatible.

Field names are then a bit more difficult. Actually, we'll probably just
have to coerce them to either Bytes or Unicode internally, since we'll
need to do that on unpickling if we want to be backward-compatible.

 Or will we just have a minimum of code to fix up?

I think we will need in any case to replace all use of PyString in Numpy
by PyBytes or PyUnicode, depending on context, and #define PyString
PyBytes for Python 2.

This seems to be the easiest way to make sure we have fixed all points
that need fixing.

Currently, 193 of 800 numpy.core tests don't pass, and this seems
largely due to Bytes vs. Unicode issues.

 And could you expand on the changes that repr() might undergo?

The main thing is that

dtype('i4')
dtype([('a', 'i4')])

may become

dtype(b'i4')
dtype([(b'a', b'i4')])

Of course, we can write and #ifdef separate repr formatting code for
Py3, but this is a bit of extra work.

 Mind, I think using bytes sounds best, but I haven't looked into the
 whole strings part of the transition and don't have an informed
 opinion on the matter.

***

By the way, should I commit this stuff (after factoring the commits to
logical chunks) to SVN?

It does not break anything for Python 2, at least as far as the test
suite is concerned.

Pauli


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread David Cournapeau
Pauli Virtanen wrote:
 By the way, should I commit this stuff (after factoring the commits to
 logical chunks) to SVN?
   

I would prefer getting at least one py3 buildbot before doing anything
significant,

cheers,

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Pauli Virtanen
pe, 2009-11-27 kello 18:30 +0900, David Cournapeau kirjoitti:
 Pauli Virtanen wrote:
  By the way, should I commit this stuff (after factoring the commits to
  logical chunks) to SVN?

 I would prefer getting at least one py3 buildbot before doing anything
 significant,

I can add it to mine:
http://buildbot.scipy.org/builders/Linux_x86_Ubuntu/builds/279/steps/shell_1/logs/stdio

It already does 2.4, 2.5 and 2.6.

Pauli


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Francesc Alted
A Friday 27 November 2009 10:47:53 Pauli Virtanen escrigué:
 1) For 'S' dtype, I believe we use Bytes for the raw data and the
interface.
 
Maybe we want to introduce a separate bytes dtype that's an alias
for 'S'?

Yeah.  As regular strings in Python 3 are Unicode, I think that introducing 
separate bytes dtype would help doing the transition.  Meanwhile, the next 
should still work:

In [2]: s = np.array(['asa'], dtype=S10)

In [3]: s[0]
Out[3]: 'asa'  # will become b'asa' in Python 3

In [4]: s.dtype.itemsize
Out[4]: 10 # still 1-byte per element

Also, I suppose that there will be issues with the current Unicode support in 
NumPy:

In [5]: u = np.array(['asa'], dtype=U10)

In [6]: u[0]
Out[6]: u'asa'  # will become 'asa' in Python 3

In [7]: u.dtype.itemsize
Out[7]: 40  # not sure about the size in Python 3

For example, if it is true that internal strings in Python 3 and Unicode UTF-8 
(as René seems to suggest), I suppose that the internal conversions from 2-
bytes or 4-bytes (depending on how the Python interpreter has been compiled) 
in NumPy Unicode dtype to the new Python string should have to be reworked 
(perhaps you have dealt with that already).

Cheers,

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Pauli Virtanen
pe, 2009-11-27 kello 11:17 +0100, Francesc Alted kirjoitti:
 A Friday 27 November 2009 10:47:53 Pauli Virtanen escrigué:
  1) For 'S' dtype, I believe we use Bytes for the raw data and the
 interface.
  
 Maybe we want to introduce a separate bytes dtype that's an alias
 for 'S'?
 
 Yeah.  As regular strings in Python 3 are Unicode, I think that introducing 
 separate bytes dtype would help doing the transition.  Meanwhile, the next 
 should still work:
 
 In [2]: s = np.array(['asa'], dtype=S10)
 
 In [3]: s[0]
 Out[3]: 'asa'  # will become b'asa' in Python 3
 
 In [4]: s.dtype.itemsize
 Out[4]: 10 # still 1-byte per element

Yes. But now I wonder, should

array(['foo'], str)
array(['foo'])

be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U',
which will mean unavoidable code breakage -- there's probably no
avoiding it.

[clip]
 Also, I suppose that there will be issues with the current Unicode support in 
 NumPy:
 
 In [5]: u = np.array(['asa'], dtype=U10)
 
 In [6]: u[0]
 Out[6]: u'asa'  # will become 'asa' in Python 3
 
 In [7]: u.dtype.itemsize
 Out[7]: 40  # not sure about the size in Python 3

I suspect the Unicode stuff will keep working without major changes,
except maybe dropping the u in repr. It is difficult to believe the
CPython guys would have significantly changed the current Unicode
implementation, if they didn't bother changing the names of the
functions :)

 For example, if it is true that internal strings in Python 3 and Unicode 
 UTF-8 
 (as René seems to suggest), I suppose that the internal conversions from 2-
 bytes or 4-bytes (depending on how the Python interpreter has been compiled) 
 in NumPy Unicode dtype to the new Python string should have to be reworked 
 (perhaps you have dealt with that already).

I don't think they are internally UTF-8:
http://docs.python.org/3.1/c-api/unicode.html

Python’s default builds use a 16-bit type for Py_UNICODE and store
Unicode values internally as UCS2.

-- 
Pauli Virtanen


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Francesc Alted
A Friday 27 November 2009 11:27:00 Pauli Virtanen escrigué:
 Yes. But now I wonder, should
 
   array(['foo'], str)
   array(['foo'])
 
 be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U',
 which will mean unavoidable code breakage -- there's probably no
 avoiding it.

Mmh, you are right.  Yes, this seems to be difficult to solve.  Well, I'm 
changing my mind and think that both 'str' and 'S' should stand for Unicode in 
NumPy for Python 3.  If people is aware of the change for Python 3, they 
should be expecting the same change happening in NumPy too, I guess.  Then, I 
suppose that a new dtype bytes that replaces the existing string would be 
absolutely necessary.

  Also, I suppose that there will be issues with the current Unicode
  support in NumPy:
 
  In [5]: u = np.array(['asa'], dtype=U10)
 
  In [6]: u[0]
  Out[6]: u'asa'  # will become 'asa' in Python 3
 
  In [7]: u.dtype.itemsize
  Out[7]: 40  # not sure about the size in Python 3
 
 I suspect the Unicode stuff will keep working without major changes,
 except maybe dropping the u in repr. It is difficult to believe the
 CPython guys would have significantly changed the current Unicode
 implementation, if they didn't bother changing the names of the
 functions :)
 
  For example, if it is true that internal strings in Python 3 and Unicode
  UTF-8 (as René seems to suggest), I suppose that the internal conversions
  from 2- bytes or 4-bytes (depending on how the Python interpreter has
  been compiled) in NumPy Unicode dtype to the new Python string should
  have to be reworked (perhaps you have dealt with that already).
 
 I don't think they are internally UTF-8:
 http://docs.python.org/3.1/c-api/unicode.html
 
 Python’s default builds use a 16-bit type for Py_UNICODE and store
 Unicode values internally as UCS2.

Ah!  No changes for that matter.  Much better then.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread René Dudfield
On Fri, Nov 27, 2009 at 11:50 AM, Francesc Alted fal...@pytables.org wrote:
 A Friday 27 November 2009 11:27:00 Pauli Virtanen escrigué:
 Yes. But now I wonder, should

       array(['foo'], str)
       array(['foo'])

 be of dtype 'S' or 'U' in Python 3? I think I'm leaning towards 'U',
 which will mean unavoidable code breakage -- there's probably no
 avoiding it.

 Mmh, you are right.  Yes, this seems to be difficult to solve.  Well, I'm
 changing my mind and think that both 'str' and 'S' should stand for Unicode in
 NumPy for Python 3.  If people is aware of the change for Python 3, they
 should be expecting the same change happening in NumPy too, I guess.  Then, I
 suppose that a new dtype bytes that replaces the existing string would be
 absolutely necessary.

  Also, I suppose that there will be issues with the current Unicode
  support in NumPy:
 
  In [5]: u = np.array(['asa'], dtype=U10)
 
  In [6]: u[0]
  Out[6]: u'asa'  # will become 'asa' in Python 3
 
  In [7]: u.dtype.itemsize
  Out[7]: 40      # not sure about the size in Python 3

 I suspect the Unicode stuff will keep working without major changes,
 except maybe dropping the u in repr. It is difficult to believe the
 CPython guys would have significantly changed the current Unicode
 implementation, if they didn't bother changing the names of the
 functions :)

  For example, if it is true that internal strings in Python 3 and Unicode
  UTF-8 (as René seems to suggest), I suppose that the internal conversions
  from 2- bytes or 4-bytes (depending on how the Python interpreter has
  been compiled) in NumPy Unicode dtype to the new Python string should
  have to be reworked (perhaps you have dealt with that already).

 I don't think they are internally UTF-8:
 http://docs.python.org/3.1/c-api/unicode.html

 Python’s default builds use a 16-bit type for Py_UNICODE and store
 Unicode values internally as UCS2.

 Ah!  No changes for that matter.  Much better then.



Hello,


in py3...

 'Hello\u0020World !'.encode()
b'Hello World !'
 Äpfel.encode('utf-8')
b'\xc3\x84pfel'
 Äpfel.encode()
b'\xc3\x84pfel'

The default encoding does appear to be utf-8 in py3.

Although it is compiled with something different, and stores it as
something different, that is UCS2 or UCS4.

I imagine dtype 'S' and 'U' need more clarification.  As it misses the
concept of encodings it seems?  Currently, S appears to mean 8bit
characters no encoding, and U appears to mean 16bit characters no
encoding?  Or are some sort of default encodings assumed?

2to3/3to2 fixers will probably have to be written for users code
here... whatever is decided.  At least warnings should be generated
I'm guessing.


btw, in my numpy tree there is a unicode_() alias to str in py3, and
to unicode in py2 (inside the compat.py file).  This helped us in many
cases with compatible string code in the pygame port.  This allows you
to create unicode strings on both platforms with the same code.



cheers,
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Pauli Virtanen
pe, 2009-11-27 kello 13:23 +0100, René Dudfield kirjoitti:
[clip]
 I imagine dtype 'S' and 'U' need more clarification.  As it misses the
 concept of encodings it seems?  Currently, S appears to mean 8bit
 characters no encoding, and U appears to mean 16bit characters no
 encoding?  Or are some sort of default encodings assumed?

Currently in Numpy in Python 2, 'S' is the same as Python 3 bytes, 'U'
is same as Python 3 unicode and probably in same internal representation
(need to check). Neither is associated with encoding info.

We need probably to change the meaning of 'S', as Francesc noted, and
add a separate bytes dtype.

 2to3/3to2 fixers will probably have to be written for users code
 here... whatever is decided.  At least warnings should be generated
 I'm guessing.

Possibly. Does 2to3 support plugins? If yes, it could be possible to
write one.

 btw, in my numpy tree there is a unicode_() alias to str in py3, and
 to unicode in py2 (inside the compat.py file).  This helped us in many
 cases with compatible string code in the pygame port.  This allows you
 to create unicode strings on both platforms with the same code.

Yes, I saw that. The name unicode_ is however already taken by the Numpy
scalar type, so we need to think of a different name for it. asstring,
maybe.

Btw, do you want to rebase your distutils changes on top of my tree? I
tried yours out quickly, but there were some issues there that prevented
distutils from working. (Also, you can use absolute imports both for
Python 2 and 3 -- there's probably no need to use relative imports.)

Pauli


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Francesc Alted
A Friday 27 November 2009 13:23:10 René Dudfield escrigué:
  I don't think they are internally UTF-8:
  http://docs.python.org/3.1/c-api/unicode.html
 
  Python’s default builds use a 16-bit type for Py_UNICODE and store
  Unicode values internally as UCS2.
 
  Ah!  No changes for that matter.  Much better then.
 
 Hello,
 
 
 in py3...
 
  'Hello\u0020World !'.encode()
 
 b'Hello World !'
 
  Äpfel.encode('utf-8')
 
 b'\xc3\x84pfel'
 
  Äpfel.encode()
 
 b'\xc3\x84pfel'
 
 The default encoding does appear to be utf-8 in py3.
 
 Although it is compiled with something different, and stores it as
 something different, that is UCS2 or UCS4.

OK.  One thing is which is the default encoding for Unicode and another is how 
Python keeps Unicode internally.  And internally Python 3 is still using UCS2 
or UCS4, i.e. the same thing than in Python 2, so no worries here.

 I imagine dtype 'S' and 'U' need more clarification.  As it misses the
 concept of encodings it seems?  Currently, S appears to mean 8bit
 characters no encoding, and U appears to mean 16bit characters no
 encoding?  Or are some sort of default encodings assumed?
[clip]

You only need encoding if you are going to represent Unicode strings with 
other types (for example bytes).  Currently, NumPy can transparently 
import/export native Python Unicode strings (UCS2 or UCS4) into its own 
Unicode type (always UCS4).  So, we don't have to worry here either.

 btw, in my numpy tree there is a unicode_() alias to str in py3, and
 to unicode in py2 (inside the compat.py file).  This helped us in many
 cases with compatible string code in the pygame port.  This allows you
 to create unicode strings on both platforms with the same code.

Correct.  But, in addition, we are going to need a new 'bytes' dtype for NumPy 
for Python 3, right?

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread René Dudfield
On Fri, Nov 27, 2009 at 1:41 PM, Pauli Virtanen p...@iki.fi wrote:
 2to3/3to2 fixers will probably have to be written for users code
 here... whatever is decided.  At least warnings should be generated
 I'm guessing.

 Possibly. Does 2to3 support plugins? If yes, it could be possible to
 write one.

You can put them in here:
[lib_dir]lib2to3/fixes/fix_*.py

I'm not sure about how to use custom ones without just copying them
in... need to research that.

There's no documentation about how to write custom ones here:
http://docs.python.org/library/2to3.html

You can pass lib2to3 a package to try import fixers from.  However I'm
not sure how to make that appear from the command line, other than
copying the fixer into place.  I guess the numpy setup script could
copy the fixer into place.




 btw, in my numpy tree there is a unicode_() alias to str in py3, and
 to unicode in py2 (inside the compat.py file).  This helped us in many
 cases with compatible string code in the pygame port.  This allows you
 to create unicode strings on both platforms with the same code.

 Yes, I saw that. The name unicode_ is however already taken by the Numpy
 scalar type, so we need to think of a different name for it. asstring,
 maybe.

something like numpy.compat.unicode_ ?


 Btw, do you want to rebase your distutils changes on top of my tree? I
 tried yours out quickly, but there were some issues there that prevented
 distutils from working. (Also, you can use absolute imports both for
 Python 2 and 3 -- there's probably no need to use relative imports.)

        Pauli


hey,

yeah I definitely would :)   I don't have much time for the next week
or so though.


cu,
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread René Dudfield
On Fri, Nov 27, 2009 at 1:49 PM, Francesc Alted fal...@pytables.org wrote:
 Correct.  But, in addition, we are going to need a new 'bytes' dtype for NumPy
 for Python 3, right?

I think so.  However, I think S is probably closest to bytes... and
maybe S can be reused for bytes... I'm not sure though.

Also, what will a bytes dtype mean within a py2 program context?  Does
it matter if the bytes dtype just fails somehow if used in a py2
program?

cheers,
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread René Dudfield
On Fri, Nov 27, 2009 at 3:07 PM, René Dudfield ren...@gmail.com wrote:

 hey,

 yeah I definitely would :)   I don't have much time for the next week
 or so though.

btw, feel free to just copy whatever you like from there into your tree.

cheers,
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Francesc Alted
A Friday 27 November 2009 15:09:00 René Dudfield escrigué:
 On Fri, Nov 27, 2009 at 1:49 PM, Francesc Alted fal...@pytables.org wrote:
  Correct.  But, in addition, we are going to need a new 'bytes' dtype for
  NumPy for Python 3, right?
 
 I think so.  However, I think S is probably closest to bytes... and
 maybe S can be reused for bytes... I'm not sure though.

That could be a good idea because that would ensure compatibility with 
existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it 
should).  The only thing that I don't like is that that 'S' seems to be the 
initial letter for 'string', which is actually 'unicode' in Python 3 :-/
But, for the sake of compatibility, we can probably live with that.

 Also, what will a bytes dtype mean within a py2 program context?  Does
 it matter if the bytes dtype just fails somehow if used in a py2
 program?

Mmh, I'm of the opinion that the new 'bytes' type should be available only 
with NumPy for Python 3.  Would that be possible?

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Pauli Virtanen
pe, 2009-11-27 kello 16:33 +0100, Francesc Alted kirjoitti:
 A Friday 27 November 2009 15:09:00 René Dudfield escrigué:
  On Fri, Nov 27, 2009 at 1:49 PM, Francesc Alted fal...@pytables.org wrote:
   Correct.  But, in addition, we are going to need a new 'bytes' dtype for
   NumPy for Python 3, right?
  
  I think so.  However, I think S is probably closest to bytes... and
  maybe S can be reused for bytes... I'm not sure though.
 
 That could be a good idea because that would ensure compatibility with 
 existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it 
 should).  The only thing that I don't like is that that 'S' seems to be the 
 initial letter for 'string', which is actually 'unicode' in Python 3 :-/
 But, for the sake of compatibility, we can probably live with that.

Well, we can deprecate 'S' (ie. never show it in repr, always only 'B'
or 'U').

  Also, what will a bytes dtype mean within a py2 program context?  Does
  it matter if the bytes dtype just fails somehow if used in a py2
  program?
 
 Mmh, I'm of the opinion that the new 'bytes' type should be available only 
 with NumPy for Python 3.  Would that be possible?

I don't see a problem in making a bytes_ scalar type available for
Python2. In fact, it would be useful for making upgrading to Py3 easier.

-- 
Pauli Virtanen


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Francesc Alted
A Friday 27 November 2009 16:41:04 Pauli Virtanen escrigué:
   I think so.  However, I think S is probably closest to bytes... and
   maybe S can be reused for bytes... I'm not sure though.
 
  That could be a good idea because that would ensure compatibility with
  existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes',
  as it should).  The only thing that I don't like is that that 'S' seems
  to be the initial letter for 'string', which is actually 'unicode' in
  Python 3 :-/ But, for the sake of compatibility, we can probably live
  with that.
 
 Well, we can deprecate 'S' (ie. never show it in repr, always only 'B'
 or 'U').

Well, deprecating 'S' seems a sensible option too.  But why only avoiding 
showing it in repr?  Why not issue a DeprecationWarning too?

   Also, what will a bytes dtype mean within a py2 program context?  Does
   it matter if the bytes dtype just fails somehow if used in a py2
   program?
 
  Mmh, I'm of the opinion that the new 'bytes' type should be available
  only with NumPy for Python 3.  Would that be possible?
 
 I don't see a problem in making a bytes_ scalar type available for
 Python2. In fact, it would be useful for making upgrading to Py3 easier.

I think introducing a bytes_ scalar dtype can be somewhat confusing for Python 
2 users.  But if the 'S' typecode is to be deprecated also for NumPy for 
Python 2, then it makes perfect sense to introduce bytes_ there too.

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Christopher Barker

 The point is that I don't think we can just decide to use Unicode or
 Bytes in all places where PyString was used earlier.

Agreed.

I think it's helpful to remember the origins of all this:


IMHO, there are two distinct types of data that Python2 strings support:

1) text: this is the traditional string.
2) bytes: raw bytes -- they could represent anything.

This, of course, is what the py3k string and bytes types are all about.

However, when python started, it just so happened that text was 
represented by an array of unsigned single byte integers, so there 
really was no point in having a bytes type, as a string would work 
just as well.

Enter unicode:

Now we have multiple ways of representing text internally, but want a 
single interface to that -- one that looks and acts like a sequence of 
characters to user's code. The result is that the unicode type was 
introduced.

In a way, unicode strings are a bit like arrays: they have an encoding 
associated with them (like a dtype in numpy). You can represent a given 
bit of text in multiple different arangements of bytes, but they are all 
supposed to mean the same thing and, if you know the encoding, you can 
convert between them. This is kind of like how one can represent 5 in 
any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any 
value represented by one dtype can be converted to all other dtypes, but 
many can. Just like encodings.

Anyway, all this brings me to think about the use of strings in numpy in 
this way: if it is meant to be a human-readable piece of text, it should 
be a unicode object. If not, then it is bytes.

So: fromstring and the like should, of course, work with bytes (though 
maybe buffers really...)

 Which one it will
 be should depend on the use. Users will expect that eg. array([1,2,3],
 dtype='f4') still works, and they don't have to do e.g. array([1,2,3],
 dtype=b'f4').

Personally, I try to use np.float32 instead, anyway, but I digress. In 
this case, the type code is supposed to be a human-readable bit of 
text -- it should be a unicode object (convertible to ascii for 
interfacing with C...)

If we used b'f4', it would confuse things, as it couldn't be printed 
right. Also: would the actual bytes involved potentially change 
depending on what encoding was used for the literal? i.e. if the code 
was written in utf16, would that byte string be 4 bytes long?

 To summarize the use cases I've ran across so far:
 
 1) For 'S' dtype, I believe we use Bytes for the raw data and the
interface.

I don't think so here. 'S' is usually used to store human-readable 
strings, I'd certainly expect to be able to do:

s_array = np.array(['this', 'that'], dtype='S10')

And I'd expect it to work with non-literals that were unicode strings, 
i.e. human readable text. In fact, it's pretty rare that I'd ever want 
bytes here. So I'd see 'S' mapped to 'U' here.

Francesc Alted wrote:
 the next  should still work:
 
 In [2]: s = np.array(['asa'], dtype=S10)
 
 In [3]: s[0]
 Out[3]: 'asa'  # will become b'asa' in Python 3

I don't like that -- I put in a string, and get a bytes object back?

 In [4]: s.dtype.itemsize
 Out[4]: 10 # still 1-byte per element

But what it the the strings passed in aren't representable in one byte 
per character? Do we define S as only supporting ANSI-only string? 
what encoding?

Pauli Virtanen wrote:
 'U'
 is same as Python 3 unicode and probably in same internal representation
 (need to check). Neither is associated with encoding info.

Isn't it? I thought the encoding was always the same internally? so it 
is known?

Francesc Alted wrote:
 That could be a good idea because that would ensure compatibility with 
 existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes', as it 
 should).

What do you mean by compatible? It wold mean a lot of user code would 
have to change with the 2-3 transition.

 The only thing that I don't like is that that 'S' seems to be the 
 initial letter for 'string', which is actually 'unicode' in Python 3 :-/
 But, for the sake of compatibility, we can probably live with that.

I suppose we could at least depricate it.

 Also, what will a bytes dtype mean within a py2 program context?  Does
 it matter if the bytes dtype just fails somehow if used in a py2
 program?

well, it should work in 2.6 anyway.

Maybe we want to introduce a separate bytes dtype that's an alias
for 'S'?

What do we need bytes for? does it support anything that np.uint8 
doesn't?


 2) The field names:
 
   a = array([], dtype=[('a', int)])
   a = array([], dtype=[(b'a', int)])
 
 This is somewhat of an internal issue. We need to decide whether we
 internally coerce input to Unicode or Bytes.

Unicode is clear to me here -- it really should match what Python does 
for variable names -- that is unicode in py3k, no?

 3) Format strings
 
   a = array([], dtype=b'i4')
 
 I don't think it makes sense to handle format strings in Unicode
 internally -- they 

Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Pauli Virtanen
pe, 2009-11-27 kello 10:36 -0800, Christopher Barker kirjoitti:
[clip]
  Which one it will
  be should depend on the use. Users will expect that eg. array([1,2,3],
  dtype='f4') still works, and they don't have to do e.g. array([1,2,3],
  dtype=b'f4').
 
 Personally, I try to use np.float32 instead, anyway, but I digress. In 
 this case, the type code is supposed to be a human-readable bit of 
 text -- it should be a unicode object (convertible to ascii for 
 interfacing with C...)

Yes, this would solve the repr() issue easily. Now that I look more
closely, the format strings are not actually used anywhere else than in
the descriptor user interface, so from an implementation POV Unicode is
not any harder.

[clip]
 Pauli Virtanen wrote:
  'U'
  is same as Python 3 unicode and probably in same internal representation
  (need to check). Neither is associated with encoding info.
 
 Isn't it? I thought the encoding was always the same internally? so it 
 is known?

Yes, so it needs not be associated with a separate piece of encoding
info.

[clip]
 Maybe we want to introduce a separate bytes dtype that's an alias
 for 'S'?
 
 What do we need bytes for? does it support anything that np.uint8 
 doesn't?

It has a string representation, but that's probably it.

Actually, in Python 3, when you index a bytes object, you get integers
back, so we just aliasing bytes_ = uint8 and making sure array() handles
byte objects appropriately would be more or less consistent.

  2) The field names:
  
  a = array([], dtype=[('a', int)])
  a = array([], dtype=[(b'a', int)])
  
  This is somewhat of an internal issue. We need to decide whether we
  internally coerce input to Unicode or Bytes.
 
 Unicode is clear to me here -- it really should match what Python does 
 for variable names -- that is unicode in py3k, no?

Yep, let's follow Python. So Unicode and only Unicode it is.

***

Ok, thanks for the feedback. The right answers seem to be:

1) Unicode works as it is now, and Python3 strings are Unicode.

   Bytes objects are coerced to uint8 by array(). We don't do implicit
   conversions between Bytes and Unicode.

   The 'S' dtype character will be deprecated, never appear in repr(),
   and its usage will result to a warning.

2) Field names are unicode always.

   Some backward compatibility needs to be added in pickling, and
   maybe the npy file format needs a fixed encoding.

3) Dtype strings are an user interface detail, and will be Unicode.

-- 
Pauli Virtanen


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Anne Archibald
2009/11/27 Christopher Barker chris.bar...@noaa.gov:

 The point is that I don't think we can just decide to use Unicode or
 Bytes in all places where PyString was used earlier.

 Agreed.

I only half agree. It seems to me that for almost all situations where
PyString was used, the right data type is a python3 string (which is
unicode). I realize there may be some few cases where it is
appropriate to use bytes, but I think there needs to be a compelling
reason for each one.

 In a way, unicode strings are a bit like arrays: they have an encoding
 associated with them (like a dtype in numpy). You can represent a given
 bit of text in multiple different arangements of bytes, but they are all
 supposed to mean the same thing and, if you know the encoding, you can
 convert between them. This is kind of like how one can represent 5 in
 any of many dtypes: uint8, int16, int32, float32, float64, etc. Not any
 value represented by one dtype can be converted to all other dtypes, but
 many can. Just like encodings.

This is incorrect. Unicode objects do not have default encodings or
multiple internal representations (within a single python interpreter,
at least). Unicode objects use 2- or 4-byte internal representations
internally, but this is almost invisible to the user. Encodings only
become relevant when you want to convert a unicode object to a byte
stream. It is usually an error to store text in a byte stream (for it
to make sense you must provide some mechanism to specify the
encoding).

 Anyway, all this brings me to think about the use of strings in numpy in
 this way: if it is meant to be a human-readable piece of text, it should
 be a unicode object. If not, then it is bytes.

 So: fromstring and the like should, of course, work with bytes (though
 maybe buffers really...)

I think if you're going to call it fromstring, it should onvert from
strings (i.e. unicode strings). But really, I think it makes more
sense to rename it frombytes() and have it convert bytes objects. One
could then have
def fromstring(s, encoding=utf-8):
return frombytes(s.encode(encoding))
as a shortcut. Maybe ASCII makes more sense as a default encoding. But
really, think about where the user's going to get the srting: most of
the time it's coming from a disk file or a network stream, so it will
be a byte string already, so they should use frombytes.

 To summarize the use cases I've ran across so far:

 1) For 'S' dtype, I believe we use Bytes for the raw data and the
    interface.

 I don't think so here. 'S' is usually used to store human-readable
 strings, I'd certainly expect to be able to do:

 s_array = np.array(['this', 'that'], dtype='S10')

 And I'd expect it to work with non-literals that were unicode strings,
 i.e. human readable text. In fact, it's pretty rare that I'd ever want
 bytes here. So I'd see 'S' mapped to 'U' here.

+1

 Francesc Alted wrote:
 the next  should still work:

 In [2]: s = np.array(['asa'], dtype=S10)

 In [3]: s[0]
 Out[3]: 'asa'  # will become b'asa' in Python 3

 I don't like that -- I put in a string, and get a bytes object back?

I agree.

 In [4]: s.dtype.itemsize
 Out[4]: 10     # still 1-byte per element

 But what it the the strings passed in aren't representable in one byte
 per character? Do we define S as only supporting ANSI-only string?
 what encoding?

Itemsize will change. That's fine.

 3) Format strings

       a = array([], dtype=b'i4')

 I don't think it makes sense to handle format strings in Unicode
 internally -- they should always be coerced to bytes.

 This should be fine -- we control what is a valid format string, and
 thus they can always be ASCII-safe.

I have to disagree. Why should we force the user to use bytes? The
format strings are just that, strings, and we should be able to supply
python strings to them. Keep in mind that coercing strings to bytes
requires extra information, namely the encoding. If you want to
emulate python2's value-dependent coercion - raise an exception only
if non-ASCII is present - keep in mind that python3 is specifically
removing that behaviour because of the problems it caused.


Anne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Christopher Barker
Anne Archibald wrote:

 I don't think it makes sense to handle format strings in Unicode
 internally -- they should always be coerced to bytes.
 This should be fine -- we control what is a valid format string, and
 thus they can always be ASCII-safe.
 
 I have to disagree. Why should we force the user to use bytes?

One of us mis-understood that -- I THINK the idea was that internally 
numpy would use bytes (for easy conversion to/from char*), but they 
would get converted, so the use could pass in unicode strings (or 
bytes). I guess the questions remains as to what you'd get when you 
printed a format string.

  Keep in mind that coercing strings to bytes
 requires extra information, namely the encoding.

but that is built-in to the unicode object.

I think the idea is that a format string is ALWAYS ASCII -f there are 
any other characters in there, it's an invalid format anyway.

Unless I mis-understand what a format string is. I think it's a string 
you use to represent a custom dtype -- it that right?

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-27 Thread Dag Sverre Seljebotn
Francesc Alted wrote:
 A Friday 27 November 2009 16:41:04 Pauli Virtanen escrigué:
 I think so.  However, I think S is probably closest to bytes... and
 maybe S can be reused for bytes... I'm not sure though.
 That could be a good idea because that would ensure compatibility with
 existing NumPy scripts (i.e. old 'string' dtypes are mapped to 'bytes',
 as it should).  The only thing that I don't like is that that 'S' seems
 to be the initial letter for 'string', which is actually 'unicode' in
 Python 3 :-/ But, for the sake of compatibility, we can probably live
 with that.
 Well, we can deprecate 'S' (ie. never show it in repr, always only 'B'
 or 'U').
 
 Well, deprecating 'S' seems a sensible option too.  But why only avoiding 
 showing it in repr?  Why not issue a DeprecationWarning too?

One thing to keep in mind here is that PEP 3118 actually defines a 
standard dtype format string, which is (mostly) incompatible with 
NumPy's. It should probably be supported as well when PEP 3118 is 
implemented.

Just something to keep in the back of ones mind when discussing this. 
For instance one could, instead of inventing something new, adopt the 
characters PEP 3118 uses (if there isn't a conflict):

  - b: Raw byte
  - c: ucs-1 encoding (latin 1, one byte)
  - u: ucs-2 encoding, two bytes
  - w: ucs-4 encoding, four bytes

Long-term I hope the NumPy-specific format string will be deprecated, so 
that repr print out the PEP 3118 format string etc. But, I'm aware that 
API breakage shouldn't happen when porting to Python 3.

-- 
Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-26 Thread Pauli Virtanen
Hi,

The Python 3 porting needs some decisions on what is Bytes and
what is Unicode.

I'm currently taking the following approach. Comments?

***

dtype field names

Either Bytes or Unicode.
But 'a' and b'a' are *different* fields.

The issue is that:
Python 2: {'a': 2}[u'a'] == 2, {u'a': 2}['a'] == 2
Python 3: {'a': 2}[b'a'], {b'a': 2}['a'] raise exceptions
so the current assumptions in the C code of u'a' == b'a'
cease to hold.

dtype titles

If Bytes or Unicode, work similarly as field names.

dtype format strings, datetime tuple, and any other protocol strings

Bytes. User can pass in Unicode, but it's converted using
UTF8 codec.

This will likely change repr() of various objects. Acceptable?


-- 
Pauli Virtanen



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-26 Thread Charles R Harris
Hi Pauli,

On Thu, Nov 26, 2009 at 4:08 PM, Pauli Virtanen p...@iki.fi wrote:

 Hi,

 The Python 3 porting needs some decisions on what is Bytes and
 what is Unicode.

 I'm currently taking the following approach. Comments?

***

 dtype field names

Either Bytes or Unicode.
But 'a' and b'a' are *different* fields.

The issue is that:
Python 2: {'a': 2}[u'a'] == 2, {u'a': 2}['a'] == 2
Python 3: {'a': 2}[b'a'], {b'a': 2}['a'] raise exceptions
so the current assumptions in the C code of u'a' == b'a'
cease to hold.

 dtype titles

If Bytes or Unicode, work similarly as field names.

 dtype format strings, datetime tuple, and any other protocol strings

Bytes. User can pass in Unicode, but it's converted using
UTF8 codec.

This will likely change repr() of various objects. Acceptable?


I'm not clear on your recommendation here, is it that we should use bytes,
with unicode converted to UTF8? Will that support arrays that have been
pickled and such? Or will we just have a minimum of code to fix up? And
could you expand on the changes that repr() might undergo?

Mind, I think using bytes sounds best, but I haven't looked into the whole
strings part of the transition and don't have an informed opinion on the
matter.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bytes vs. Unicode in Python3

2009-11-26 Thread René Dudfield
On Fri, Nov 27, 2009 at 1:37 AM, Charles R Harris
charlesr.har...@gmail.com wrote:
 Hi Pauli,

 On Thu, Nov 26, 2009 at 4:08 PM, Pauli Virtanen p...@iki.fi wrote:

 Hi,

 The Python 3 porting needs some decisions on what is Bytes and
 what is Unicode.

 I'm currently taking the following approach. Comments?

        ***

 dtype field names

        Either Bytes or Unicode.
        But 'a' and b'a' are *different* fields.

        The issue is that:
            Python 2: {'a': 2}[u'a'] == 2, {u'a': 2}['a'] == 2
            Python 3: {'a': 2}[b'a'], {b'a': 2}['a'] raise exceptions
        so the current assumptions in the C code of u'a' == b'a'
        cease to hold.

 dtype titles

        If Bytes or Unicode, work similarly as field names.

 dtype format strings, datetime tuple, and any other protocol strings

        Bytes. User can pass in Unicode, but it's converted using
        UTF8 codec.

        This will likely change repr() of various objects. Acceptable?


 I'm not clear on your recommendation here, is it that we should use bytes,
 with unicode converted to UTF8? Will that support arrays that have been
 pickled and such? Or will we just have a minimum of code to fix up? And
 could you expand on the changes that repr() might undergo?

 Mind, I think using bytes sounds best, but I haven't looked into the whole
 strings part of the transition and don't have an informed opinion on the
 matter.

 Chuck


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



To help clarify for people who are not familiar with python 3...

To put it simply... in py3,
str == unicode utf-8, there is no 'unicode' anymore.
bytes == raw data... but kind of like a str type with less methods.
bytes + encoding is how you do non utf-8 strings.
'array' exists in both py2 and py3 with a very similar interface on both.

There's a more precise description of strings in python3 on these pages:
http://diveintopython3.org/strings.html
http://diveintopython3.org/porting-code-to-python-3-with-2to3.html


It depends on the use cases for each thing which will depend on how it
should work imho.  Mostly if you are using the str type, then keep
using the str type.

Many functions take both bytes and strings.  Since it is sane to work
on both bytes and strings from a users perspective.  There have been
some methods in the stdlib that have not consumed both, and they have
been treated as bugs, and are being fixed (eg, some urllib methods).



For dtype, using the python 'str' by default seems ok.  Since all of
those characters come out in the same manner on both pythons for the
data used by numpy.

eg.  'float32' is shown the same as a py3 string as a py2 string.
Internally it is unicode data however.

Within py2, we save a pickle with the str:
 import pickle
 pickle.dump(s, open('/tmp/p.pickle', 'wb'))
 pickle.dump(s, open('/tmp/p.pickle', 'wb'))
 pickle.dump('float32', open('/tmp/p.pickle', 'wb'))


Within py3 we open the pickle with the str:
 import pickle
 pickle.load(open('/tmp/p.pickle', 'rb'))
'float32'



cheers,
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] .bytes

2007-10-15 Thread Yves Revaz
Nadav Horesh wrote:

array(1, dtype=float32).itemsize

  

ok, it will work fine for my purpose.
In numpy, is there any reason to supress the attribute .bytes from the 
type object itself ?
Is it simply because the native python types (int, float, complex, etc.) 
do not have this attribute ?

  



___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
  



-- 
(o o)
oOO--(_)--OOo---
  Yves Revaz
  Lerma Batiment A   Tel : ++ 33 (0) 1 40 51 20 79
  Observatoire de Paris  Fax : ++ 33 (0) 1 40 51 20 02 
  77 av Denfert-Rochereaue-mail : [EMAIL PROTECTED]
  F-75014 Paris  Web : http://obswww.unige.ch/~revaz/
  FRANCE 


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] .bytes

2007-10-15 Thread Robert Kern
Yves Revaz wrote:
 Nadav Horesh wrote:
 
 array(1, dtype=float32).itemsize

 ok, it will work fine for my purpose.
 In numpy, is there any reason to supress the attribute .bytes from the 
 type object itself ?
 Is it simply because the native python types (int, float, complex, etc.) 
 do not have this attribute ?

The problem is that the instances of the scalar types do have the itemsize
attribute. The implementation of type objects is such that the type object will
also have that attribute, but it will be a stub:

In [15]: float64.itemsize
Out[15]: attribute 'itemsize' of 'numpy.generic' objects

A more straightforward way to get the itemsize is this:

In [17]: dtype(float64).itemsize
Out[17]: 8

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth.
  -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] .bytes

2007-10-14 Thread Yves Revaz
Dear list,

I'm translating codes from numarray to numpy.
Unfortunately, I'm unable to find the equivalent of the command
that give the number of bytes for a given type :
using numarray I used :

  Float32.bytes
4

I'm sure there is a solution in numpy,
but I'm unable to find it.

Thanks,

Yves

-- 
(o o)
oOO--(_)--OOo---
  Yves Revaz
  Lerma Batiment A   Tel : ++ 33 (0) 1 40 51 20 79
  Observatoire de Paris  Fax : ++ 33 (0) 1 40 51 20 02 
  77 av Denfert-Rochereaue-mail : [EMAIL PROTECTED]
  F-75014 Paris  Web : http://www.lunix.ch/revaz/
  FRANCE 


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] .bytes

2007-10-14 Thread Matthieu Brucher
Hi,

In the description field, you have itemsize which is what you want.

Matthieu

2007/10/14, Yves Revaz [EMAIL PROTECTED]:

 Dear list,

 I'm translating codes from numarray to numpy.
 Unfortunately, I'm unable to find the equivalent of the command
 that give the number of bytes for a given type :
 using numarray I used :

  Float32.bytes
 4

 I'm sure there is a solution in numpy,
 but I'm unable to find it.

 Thanks,

 Yves

 --
 (o o)
 oOO--(_)--OOo---
   Yves Revaz
   Lerma Batiment A   Tel : ++ 33 (0) 1 40 51 20 79
   Observatoire de Paris  Fax : ++ 33 (0) 1 40 51 20 02
   77 av Denfert-Rochereaue-mail : [EMAIL PROTECTED]
   F-75014 Paris  Web : http://www.lunix.ch/revaz/
   FRANCE
 

 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion