[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2012-01-05 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

Closing now.

--
nosy: +benjamin.peterson
resolution:  - out of date
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-09-29 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

The PEP 393 has been accepted and merge into Python 3.3. Python 3.3 doesn't 
need the Py_UNICODE_NEXT macro anymore. But my macros (unicode_macros.patch) 
are still useful.

--
versions: +Python 3.2 -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-09-29 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Py_UNICODE_NEXT has been removed from 3.3 but it's still available and used in 
2.7/3.2 (even if it's private).  In order to fix #10521 on 2.7/3.2 the 
_Py_UNICODE_PUT_NEXT macro attached to this patch is required.

--
versions: +Python 3.3 -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-22 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

The attached patch adds the following 4 public macros to unicodeobjects.h:
  Py_UNICODE_IS_SURROGATE(ch)
  Py_UNICODE_IS_HIGH_SURROGATE(ch)
  Py_UNICODE_IS_LOW_SURROGATE(ch)
  Py_UNICODE_JOIN_SURROGATES(high, low)
and documents them.

Since _Py_UNICODE_NEXT is still private, I'll commit it later as part as #9200.

--
Added file: http://bugs.python.org/file23000/issue10542b.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-22 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Ezio Melotti wrote:
 
 Ezio Melotti ezio.melo...@gmail.com added the comment:
 
 The attached patch adds the following 4 public macros to unicodeobjects.h:
   Py_UNICODE_IS_SURROGATE(ch)
   Py_UNICODE_IS_HIGH_SURROGATE(ch)
   Py_UNICODE_IS_LOW_SURROGATE(ch)
   Py_UNICODE_JOIN_SURROGATES(high, low)
 and documents them.
 
 Since _Py_UNICODE_NEXT is still private, I'll commit it later as part as 
 #9200.

Looks good.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-22 Thread Roundup Robot

Roundup Robot devn...@psf.upfronthosting.co.za added the comment:

New changeset 77171f993bf2 by Ezio Melotti in branch 'default':
#10542: Add 4 macros to work with surrogates: Py_UNICODE_IS_SURROGATE, 
Py_UNICODE_IS_HIGH_SURROGATE, Py_UNICODE_IS_LOW_SURROGATE, 
Py_UNICODE_JOIN_SURROGATES.
http://hg.python.org/cpython/rev/77171f993bf2

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-18 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

I attached a patch to fix the str.is* methods on #9200 that also includes the 
macro.

Since they are not public there, I don't see a reason to do 2 separate commits 
on 2.7/3.2 (one for the feature and one for the fix).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Le 17/08/2011 07:04, Ezio Melotti a écrit :
 As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and 
 Py_UNICODE_JOIN_SURROGATES can be committed without trailing _ in 3.3 and 
 with trailing _ in 2.7/3.2.  They should go in unicodeobject.h

For Python 2.7 and 3.2, I would prefer to not touch a public header, and 
so add the macros in unicodeobject.c.

 and be public in 3.3+.

If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, 
they will use to substract 0x1 themself (whereas my macros require 
the ordinal to be preproceed).

   * _Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have 
 agreed about the name they can go in.  They can be private in all the 3 
 branches and made public in 3.4 if they work well;

Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2.

   * IS_NONBMP doesn't simplify much the code but makes it more readable.  ICU 
 has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we 
 add this macro it might be ok to check for non-BMP;

If you want to make it public, it's better to call it PyUNICODE_IS_BMP() 
(check if the argument is in U+-U+).

   * I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with 
 _Py_UNICODE_NEXT.  If they are they should get a better name because the 
 current one is not clear about what they do.

They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit 
wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in 
unicodeobject.c.

 Unless someone disagrees I'll prepare a patch with 
 PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for 
 unicodeobject.h, using them where necessary, using with Victor implementation 
 and commit it (after a review).

Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not 
Py_UNICODE_JOIN_SURROGATES). I used the verb combine, taken from a 
comment in unicodeobject.c. combine is maybe better than join?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

STINNER Victor wrote:
 
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 Le 17/08/2011 07:04, Ezio Melotti a écrit :
 As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and 
 Py_UNICODE_JOIN_SURROGATES can be committed without trailing _ in 3.3 and 
 with trailing _ in 2.7/3.2.  They should go in unicodeobject.h

Ezio used two different naming schemes in his email. Please always
use Py_UNICODE_... or _Py_UNICODE (not PyUNICODE_ or _PyUNICODE_).

 For Python 2.7 and 3.2, I would prefer to not touch a public header, and 
 so add the macros in unicodeobject.c.

Why would you want to touch Python 2.7 at all ?

 and be public in 3.3+.
 
 If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, 
 they will use to substract 0x1 themself (whereas my macros require 
 the ordinal to be preproceed).

This can be done by having two definitions of the macros: one set for
UCS2 builds and one for UCS4.

   * _Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have 
 agreed about the name they can go in.  They can be private in all the 3 
 branches and made public in 3.4 if they work well;
 
 Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2.

Certainly not into Python 2.7. Adding macros in patch level releases is
also not such a good idea.

   * IS_NONBMP doesn't simplify much the code but makes it more readable.  
 ICU has U_IS_BMP, but in most of the cases we want to check for non-BMP, so 
 if we add this macro it might be ok to check for non-BMP;
 
 If you want to make it public, it's better to call it PyUNICODE_IS_BMP() 
 (check if the argument is in U+-U+).

Py_UNICODE_IS_BMP() please.

   * I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with 
 _Py_UNICODE_NEXT.  If they are they should get a better name because the 
 current one is not clear about what they do.
 
 They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit 
 wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in 
 unicodeobject.c.

 Unless someone disagrees I'll prepare a patch with 
 PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for 
 unicodeobject.h, using them where necessary, using with Victor 
 implementation and commit it (after a review).
 
 Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not 
 Py_UNICODE_JOIN_SURROGATES). I used the verb combine, taken from a 
 comment in unicodeobject.c. combine is maybe better than join?

No, Py_UNICODE_... please !

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com


2011-10-04: PyCon DE 2011, Leipzig, Germany48 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 For Python 2.7 and 3.2, I would prefer to not touch a public header, 
 and so add the macros in unicodeobject.c.

Is there some reason for this?  I think it's better if we have them in the same 
place rather than renaming and moving them in another file between 3.2 and 3.3.

 If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros 
 public, they will use to substract 0x1 themself (whereas my 
 macros require the ordinal to be preproceed).

If they turn out to be useful and we find a clearer name we can even make them 
public in 3.3, but we'll have to see about that.

 Note: I don't think that _Py_UNICODE*NEXT should go into
 Python 2.7 or 3.2.

If they don't it won't be possible to fix #9200 in those branches (unless we 
decide that the bug shouldn't be fixed there, but I would rather fix it).

 If you want to make it public, it's better to call it 
 PyUNICODE_IS_BMP() (check if the argument is in U+-U+).

Yes, public APIs will follow the naming conventions.  Not sure if it's better 
to check if it's a BMP char, or if it's not.

 They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit 
 wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in 
 unicodeobject.c.

What are the naming convention for private macros in the same .c file where 
they are used?  Shouldn't they get at least a trailing _?

 Unless someone disagrees I'll prepare a patch with 
 PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for 
 unicodeobject.h, using them where necessary, using with Victor implementation 
 and commit it (after a review).

 Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not 
 Py_UNICODE_JOIN_SURROGATES).

All the other macros use PyUNICODE_*.

 I used the verb combine, taken from a  comment in unicodeobject.c. 
 combine is maybe better than join?

I like join, it's clear enough and shorter.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Ah yes, the correct prefix for functions working on Py_UNICODE 
characters/strings is Py_UNICODE, not PyUNICODE, sorry.

 For Python 2.7 and 3.2, I would prefer to not touch a public header,
 and so add the macros in unicodeobject.c.

 Is there some reason for this?

We don't add new features to stable releases.

 If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros
 public, they will use to substract 0x1 themself (whereas my
 macros require the ordinal to be preproceed).

 If they turn out to be useful and we find a clearer name we can even make 
 them public in 3.3, but we'll have to see about that.

I don't think that they are useful outside unicodeobject.c.

 Note: I don't think that _Py_UNICODE*NEXT should go into
 Python 2.7 or 3.2.

 If they don't it won't be possible to fix #9200 in those branches

I don't think that #9200 is a bug, but more a feature request.

 Not sure if it's better to check if it's a BMP char, or if it's not.

I prefer a shorter name and avoiding double negation: 
!Py_UNICODE_IS_NON_BMP(ch).

 What are the naming convention for private macros in the same .c file where 
 they are used?

Hopefully, there is no convention for private macros :-)

  Shouldn't they get at least a trailing _?

Nope.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 Ezio used two different naming schemes in his email. Please always
 use Py_UNICODE_... or _Py_UNICODE (not PyUNICODE_ or _PyUNICODE_).

Indeed, that was a typo + copy/paste.  I meant to say Py_UNICODE_* and 
_Py_UNICODE_*.  Sorry about the confusion.

 Why would you want to touch Python 2.7 at all ?
 [...]
 Certainly not into Python 2.7. Adding macros in patch level releases
 is also not such a good idea.

Because it has the bug and we can fix it (the macros will be private so that we 
don't add any feature).
Also what about 3.2?  Are you saying that we should fix the bug in 3.2/3.3 only 
and leave 2.x alone or that you don't want the bug to be fixed in all the 
bug-fix releases (i.e. 2.7/3.2)?
My idea is to fix the bug in 2.7/3.2/3.3 using the macros, but only make them 
public in 3.3 so that new features are exposed only in 3.3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Ezio Melotti wrote:
 
 Ezio Melotti ezio.melo...@gmail.com added the comment:
 
 Ezio used two different naming schemes in his email. Please always
 use Py_UNICODE_... or _Py_UNICODE (not PyUNICODE_ or _PyUNICODE_).
 
 Indeed, that was a typo + copy/paste.  I meant to say Py_UNICODE_* and 
 _Py_UNICODE_*.  Sorry about the confusion.

Good :-)

 Why would you want to touch Python 2.7 at all ?
 [...]
 Certainly not into Python 2.7. Adding macros in patch level releases
 is also not such a good idea.
 
 Because it has the bug and we can fix it (the macros will be private so that 
 we don't add any feature).
 Also what about 3.2?  Are you saying that we should fix the bug in 3.2/3.3 
 only and leave 2.x alone or that you don't want the bug to be fixed in all 
 the bug-fix releases (i.e. 2.7/3.2)?
 My idea is to fix the bug in 2.7/3.2/3.3 using the macros, but only make them 
 public in 3.3 so that new features are exposed only in 3.3.

For bug fixes, you can put the macros straight into unicodeobject.c,
but please leave unicodeobject.h untouched - otherwise people will
mess around with these macros (even if they are private) and users
will start to wonder about linker errors if they use old patch
level releases of Python 2.7/3.2.

Also note that some of these macros change the behavior of Python
- that's good if it fixes a bug (obviously :-)), but bad if it changes
areas that are correctly implemented and then suddenly expose
new behavior.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 For bug fixes, you can put the macros straight into unicodeobject.c,
 but please leave unicodeobject.h untouched - otherwise people will
 mess around with these macros (even if they are private) and users
 will start to wonder about linker errors if they use old patch
 level releases of Python 2.7/3.2.

OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them 
in unicodeobject.c.

Regarding the name, other macros in unicodeobject.c don't have any prefix, so 
we can do the same (e.g. IS_SURROGATE) for 2.7/3.2 if that's fine.

 Also note that some of these macros change the behavior of Python
 - that's good if it fixes a bug (obviously :-)), but bad if it
 changes areas that are correctly implemented and then suddenly expose
 new behavior.

After this we can fix #9200 and make narrow builds behave correctly (i.e. like 
wide ones) with non-BMP chars (at least in some places).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Ezio Melotti wrote:
 
 Ezio Melotti ezio.melo...@gmail.com added the comment:
 
 For bug fixes, you can put the macros straight into unicodeobject.c,
 but please leave unicodeobject.h untouched - otherwise people will
 mess around with these macros (even if they are private) and users
 will start to wonder about linker errors if they use old patch
 level releases of Python 2.7/3.2.
 
 OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them 
 in unicodeobject.c.
 
 Regarding the name, other macros in unicodeobject.c don't have any prefix, so 
 we can do the same (e.g. IS_SURROGATE) for 2.7/3.2 if that's fine.

Sure.

 Also note that some of these macros change the behavior of Python
 - that's good if it fixes a bug (obviously :-)), but bad if it
 changes areas that are correctly implemented and then suddenly expose
 new behavior.
 
 After this we can fix #9200 and make narrow builds behave correctly (i.e. 
 like wide ones) with non-BMP chars (at least in some places).

Ok.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Eric V. Smith

Eric V. Smith e...@trueblade.com added the comment:

On 8/17/2011 6:30 AM, Ezio Melotti wrote:
 OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them 
 in unicodeobject.c.

I believe the second file should be unicodeobject.h, correct?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

Correct.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 Also what about 3.2?  Are you saying that we should fix the bug in
 3.2/3.3 only and leave 2.x alone or that you don't want the bug to be
 fixed in all the bug-fix releases (i.e. 2.7/3.2)?

Notice that the macros themselves don't fix any bugs. As for the bugs
you apparently want to fix using these macros: they should be considered
on a case-by-case basis. Some of your planned bug fixes may introduce
incompatibilities that rule out fixing them.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-17 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 OK, so in 2.7/3.2 I'll put them in unicodeobject.c

It looks like #9200 only needs Py_UNICODE_NEXT, which can be implemented 
without the other Py_UNICODE_*SURROGATE* macros.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Martin v. Löwis wrote:
 
 A PEP 393 draft implementation is available at 
 https://bitbucket.org/t0rsten/pep-393/ (branch pep-393); if this gets into 
 3.3, this issue will be outdated: there won't be narrow builds of Python 
 anymore (nor will there be wide builds).

Even if PEP 393 should go into Py4k one day (I don't believe that
such major changes can be done in a minor release), we will still have
to deal with surrogates in codecs, which is where these macros will get
used, so I don't see how PEP 393 relates to the idea of adding helper
macros to simplify the code.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

I think the 4 macros:
 #define _Py_UNICODE_ISSURROGATE
 #define _Py_UNICODE_ISHIGHSURROGATE
 #define _Py_UNICODE_ISLOWSURROGATE
 #define _Py_UNICODE_JOIN_SURROGATES
are quite straightforward and can avoid using the trailing _.

Since I would like to see #9200 fixed on 3.2 (and possibly 2.7 too), would it 
be ok to:
 1) commit the patch with the trailing _ for all the macros on 3.2(/2.7);
 2) commit the patch with the trailing _ only for the _NEXT macros in 3.3;
 3) fix #9200 on all these branches using the new macros (with or without _);
 4) remove the trailing _ from the _NEXT macros in 3.4 if it turns out to work 
well;


 we will still have to deal with surrogates in codecs,
 which is where these macros will get used

They will also be used in many str methods and afaiu PEP 393 should address 
that.  I'm not sure it addresses codecs and builtin functions like chr() and 
ord() too.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 I think the 4 macros:
  #define _Py_UNICODE_ISSURROGATE
  #define _Py_UNICODE_ISHIGHSURROGATE
  #define _Py_UNICODE_ISLOWSURROGATE
  #define _Py_UNICODE_JOIN_SURROGATES
 are quite straightforward and can avoid using the trailing _.

I don't want to bikeshed, but can we have proper consistent word
separation?
_Py_UNICODE_IS_HIGH_SURROGATE, not _Py_UNICODE_ISHIGHSURROGATE
(etc.)

  we will still have to deal with surrogates in codecs,
  which is where these macros will get used
 
 They will also be used in many str methods and afaiu PEP 393 should
 address that.  I'm not sure it addresses codecs and builtin functions
 like chr() and ord() too.

AFAIU, PEP 393 avoids producing surrogate pairs in the canonical
internal representation (that's one of its selling points). Only the
UTF-16 codecs would need to deal with surrogate pairs, in the encoded
form.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER and 
Py_UNICODE_TOLOWER.  I agree that keeping the words separate makes them more 
readable though.
[0]: Include/unicodeobject.h:328

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti ezio.melo...@gmail.com added the comment:

I think the 4 macros:
 #define _Py_UNICODE_ISSURROGATE
 #define _Py_UNICODE_ISHIGHSURROGATE
 #define _Py_UNICODE_ISLOWSURROGATE
 #define _Py_UNICODE_JOIN_SURROGATES
are quite straightforward and can avoid using the trailing _.

For what it's worth, I've seen Unicode documentation that talks about
that prefers the terms lead surrogate and trail surrogate as being
clearer than the terms high surrgoate and low   surrogate.

For example, from the Unicode BOM FAQ at http://unicode.org/faq/utf_bom.html

Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, 
reserved for use as the leading, and
   trailing values of paired code units in UTF-16. Leading, also called 
high, surrogates are from D800₁₆ to DBFF₁₆,
   and trailing, or low, surrogates are from DC00₁₆ to DFFF₁₆. They are 
called surrogates, since they do not
   represent characters directly, but only as a pair.

BTW, considering recent discussions, you might want to read:

Q: Are there any 16-bit values that are invalid?

A: The two values FFFE₁₆ and ₁₆ as well as the 32 values from FDD0₁₆ to 
FDEF₁₆ represent noncharacters. They are
   invalid in interchange, but may be freely used internal to an 
implementation. Unpaired surrogates are invalid as
   well, i.e. any value in the range D800₁₆ to DBFF₁₆ not followed by a 
value in the range DC00₁₆ to DFFF₁₆, or any
   value in the range DC00₁₆ to DFFF₁₆ not preceded by a value in the range 
D800₁₆ to DBFF₁₆. [AF]

and also the answer to:

Q: Are there any paired surrogates that are invalid?

whose answer I here omit for brevity, as it is a table.

I suspect that you guys are now increasingly sold on the answer to the next FAQ 
right after that one, now. :)

Q: Because supplementary characters are uncommon, does that mean I can 
ignore them?

A: Just because supplementary characters (expressed with surrogate pairs in 
UTF-16) are uncommon does 
   not mean that they should be neglected. They include:

* emoji symbols and emoticons, for interoperating with Japanese mobile 
phones
* uncommon (but not unused) CJK characters, important for personal and 
place names
* variation selectors for ideographic variation sequences
* important symbols for mathematics
* numerous minority scripts and historic scripts, important for some 
user communities

Another example of using lead and trail surrogates is in the first
sentence from http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UTF16.html

* Naming: For clarity, High and Low surrogates are called Lead and Trail in 
the API, which gives a better sense of
  their ordering in a string. offset16 and offset32 are used to distinguish 
offsets to UTF-16 boundaries vs offsets
  to UTF-32 boundaries. int char32 is used to contain UTF-32 characters, as 
opposed to char16, which is a UTF-16
  code unit.
* Roundtripping Offsets: You can always roundtrip from a UTF-32 offset to a 
UTF-16 offset and back. Because of the
  difference in structure, you can roundtrip from a UTF-16 offset to a 
UTF-32 offset and back if and only if
  bounds(string, offset16) != TRAIL.
* Exceptions: The error checking will throw an exception if indices are out 
of bounds. Other than than that, all
  methods will behave reasonably, even if unmatched surrogates or 
out-of-bounds UTF-32 values are present.
  UCharacter.isLegal() can be used to check for validity if desired.
* Unmatched Surrogates: If the string contains unmatched surrogates, then 
these are counted as one UTF-32 value.
  This matches their iteration behavior, which is vital. It also matches 
common display practice as missing glyphs
  (see the Unicode Standard Section 5.4, 5.5).
* Optimization: The method implementations may need optimization if the 
compiler doesn't fold static final methods.
  Since surrogate pairs will form an exceeding small percentage of all the 
text in the world, the singleton case
  should always be optimized for.

You can also see this reflected in the utf.h file from the ICU project as part 
of their C API in ICU4C:

#define U_SENTINEL   (-1)
This value is intended for sentinel values for APIs that (take or) 
return single code points (UChar32). 
#define U_IS_UNICODE_NONCHAR(c)
Is this code point a Unicode noncharacter? 
#define U_IS_UNICODE_CHAR(c)
Is c a Unicode code point value (0..U+10) that can be assigned 
a character? 
#define U_IS_BMP(c)   ((uint32_t)(c)=0x)
Is this code point a BMP code point (U+..U+)? 
#define U_IS_SUPPLEMENTARY(c)   ((uint32_t)((c)-0x1)=0xf)
Is this code point a supplementary code point (U+1..U+10)? 
#define U_IS_LEAD(c)   

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

I now see there are lots of good things in the BOM FAQ that have come up
lately regarding surrogates and other illegal characters, and about what
can go in data streams.  

I quote a few of these from http://unicode.org/faq/utf_bom.html below:

Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? 

A: A different issue arises if an unpaired surrogate is encountered when 
converting ill-formed UTF-16 data. 
   By represented such an *unpaired* surrogate on its own as a 3-byte 
sequence, the resulting UTF-8 data stream
   would become ill-formed. While it faithfully reflects the nature of the 
input, Unicode conformance requires
   that encoding form conversion always results in valid data stream. 
Therefore a converter *must* treat this
   as an error.

Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? 

A: If an unpaired surrogate is encountered when converting ill-formed 
UTF-16 data, any conformant converter must
   treat this as an error. By representing such an unpaired surrogate on 
its own, the resulting UTF-32 data stream
   would become ill-formed. While it faithfully reflects the nature of the 
input, Unicode conformance requires that
   encoding form conversion always results in valid data stream.

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If 
yes, then can I still assume the remaining
   UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the 
endianness of the byte stream. UTF-8
   always has the same byte order. An initial BOM is only used as a 
signature — an indication that an otherwise
   unmarked text file is in UTF-8. Note that some recipients of UTF-8 
encoded data do not expect a BOM. Where UTF-8
   is used transparently in 8-bit environments, the use of a BOM will 
interfere with any protocol or file format
   that expects specific ASCII characters at the beginning, such as the use 
of #! of at the beginning of Unix
   shell scripts.

Q: What should I do with U+FEFF in the middle of a file?

A: In the absence of a protocol supporting its use as a BOM and when not at 
the beginning of a text stream, U+FEFF
   should normally not occur. For backwards compatibility it should be 
treated as ZERO WIDTH NON-BREAKING SPACE
   (ZWNBSP), and is then part of the content of the file or string. The use 
of U+2060 WORD JOINER is strongly
   preferred over ZWNBSP for expressing word joining semantics since it 
cannot be confused with a BOM. When
   designing a markup language or data protocol, the use of U+FEFF can be 
restricted to that of Byte Order Mark. In
   that case, any U+FEFF occurring in the middle of a file can be treated 
as an unsupported character.

Q: How do I tag data that does not interpret U+FEFF as a BOM?

A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to 
indicate little-endian UTF-16 text. 
   If you do use a BOM, tag the text as simply UTF-16. 

Q: Why wouldn’t I always use a protocol that requires a BOM?

A: Where the data has an associated type, such as a field in a database, a 
BOM is unnecessary. In particular, 
   if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or 
UTF-32LE, a BOM is neither necessary *nor
   permitted*. Any U+FEFF would be interpreted as a ZWNBSP.  Do not tag 
every string in a database or set of fields
   with a BOM, since it wastes space and complicates string concatenation. 
Moreover, it also means two data fields
   may have precisely the same content, but not be binary-equal (where one 
is prefaced by a BOM).

Somewhat frustratingly, I am now almost more confused than ever by the last two 
sentences here:

Q: What is a UTF?

A: A Unicode transformation format (UTF) is an algorithmic mapping from 
every Unicode code point (except surrogate
   code points) to a unique byte sequence. The ISO/IEC 10646 standard uses 
the term “UCS transformation format” for
   UTF; the two terms are merely synonyms for the same concept.

   Each UTF is reversible, thus every UTF supports *lossless round 
tripping*: mapping from any Unicode coded
   character sequence S to a sequence of bytes and back will produce S 
again. To ensure round tripping, a UTF
   mapping *must also* map all code points that are not valid Unicode 
characters to unique byte sequences. These
   invalid code points are the 66 *noncharacters* (including FFFE and 
), as well as unpaired surrogates.

My confusion is about the invalid code points. The first two FAQs I cite at the 
top are quite clear that it is illegal
to have unpaired surrogates in a UTF stream.  I don’t understand therefore what 
it saying about “must also” mapping all
code points that aren’t valid Unicode characters to “unique byte sequences” to 
ensure 

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Antoine Pitrou rep...@bugs.python.org wrote
   on Tue, 16 Aug 2011 09:18:46 -: 

 I think the 4 macros:
  #define _Py_UNICODE_ISSURROGATE
  #define _Py_UNICODE_ISHIGHSURROGATE
  #define _Py_UNICODE_ISLOWSURROGATE
  #define _Py_UNICODE_JOIN_SURROGATES
 are quite straightforward and can avoid using the trailing _.

 I don't want to bikeshed, but can we have proper consistent word separation?
 _Py_UNICODE_IS_HIGH_SURROGATE, not _Py_UNICODE_ISHIGHSURROGATE
 (etc.)

Oh good, I thought it was only me whohadtroublereadingthose. :)

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Ezio Melotti rep...@bugs.python.org wrote
   on Tue, 16 Aug 2011 09:23:50 -: 

 All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER
 and Py_UNICODE_TOLOWER.  I agree that keeping the words separate makes them
 more readable though.

   [0]: Include/unicodeobject.h:328

I am guessing that that is not quite why those don't have underscores
in them.  I bet it is actually something else.  Watch:

% unigrep '^\s*#\s*define\s+Py_[\p{Lu}_]+\b' unicodeobject.h
#define Py_UNICODEOBJECT_H
#define Py_USING_UNICODE
#define Py_UNICODE_WIDE
#define Py_UNICODE_ISSPACE(ch) \
#define Py_UNICODE_ISLOWER(ch) _PyUnicode_IsLowercase(ch)
#define Py_UNICODE_ISUPPER(ch) _PyUnicode_IsUppercase(ch)
#define Py_UNICODE_ISTITLE(ch) _PyUnicode_IsTitlecase(ch)
#define Py_UNICODE_ISLINEBREAK(ch) _PyUnicode_IsLinebreak(ch)
#define Py_UNICODE_TOLOWER(ch) _PyUnicode_ToLowercase(ch)
#define Py_UNICODE_TOUPPER(ch) _PyUnicode_ToUppercase(ch)
#define Py_UNICODE_TOTITLE(ch) _PyUnicode_ToTitlecase(ch)
#define Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch)
#define Py_UNICODE_ISDIGIT(ch) _PyUnicode_IsDigit(ch)
#define Py_UNICODE_ISNUMERIC(ch) _PyUnicode_IsNumeric(ch)
#define Py_UNICODE_ISPRINTABLE(ch) _PyUnicode_IsPrintable(ch)
#define Py_UNICODE_TODECIMAL(ch) _PyUnicode_ToDecimalDigit(ch)
#define Py_UNICODE_TODIGIT(ch) _PyUnicode_ToDigit(ch)
#define Py_UNICODE_TONUMERIC(ch) _PyUnicode_ToNumeric(ch)
#define Py_UNICODE_ISALPHA(ch) _PyUnicode_IsAlpha(ch)
#define Py_UNICODE_ISALNUM(ch) \
#define Py_UNICODE_COPY(target, source, length) \
#define Py_UNICODE_FILL(target, value, length) \
#define Py_UNICODE_MATCH(string, offset, substring) \
#define Py_UNICODE_REPLACEMENT_CHARACTER ((Py_UNICODE) 0xFFFD)

It looks like what is actually happening there is that you started out
with names of the normal ctype(3) macroish thingies:

 isalpha isupper islower isdigit isxdigit isalnum isspace ispunct
 isprint isgraph iscntrl isblank isascii  toupper isblank isascii
 toupper tolower toascii

and wanted to preserve those, which would lead to Py_UNICODE_TOLOWER and
Py_UNICODE_TOUPPER, since there are no functions in the original C versions
those seem to mirror.  Then when you wanted more of that ilk, you sensibly
kept to the same naming convention.

I eyeball few exceptions to that style here:

% perl -nle '/^\s*#\s*define\s+(Py_[\p{Lu}_]+)\b/ and print $1' Include/*.h 
| sort -dfu | fmt -150
Py_ABSTRACTOBJECT_H Py_ALIGNED Py_ALLOW_RECURSION Py_ARITHMETIC_RIGHT_SHIFT 
Py_ASDL_H Py_AST_H Py_ATOMIC_H Py_BEGIN_ALLOW_THREADS Py_BITSET_H
Py_BLOCK_THREADS Py_BLTINMODULE_H Py_BOOLOBJECT_H Py_BYTEARRAYOBJECT_H 
Py_BYTES_CTYPE_H Py_BYTESOBJECT_H Py_CAPSULE_H Py_CELLOBJECT_H Py_CEVAL_H
Py_CHARMASK Py_CLASSOBJECT_H Py_CLEANUP_SUPPORTED Py_CLEAR 
Py_CODECREGISTRY_H Py_CODE_H Py_COMPILE_H Py_COMPLEXOBJECT_H Py_CURSES_H 
Py_DECREF
Py_DEPRECATED Py_DESCROBJECT_H Py_DICTOBJECT_H Py_DTSF_ALT Py_DTSF_SIGN 
Py_DTST_FINITE Py_DTST_INFINITE Py_DTST_NAN Py_END_ALLOW_RECURSION
Py_END_ALLOW_THREADS Py_ENUMOBJECT_H Py_EQ Py_ERRCODE_H Py_ERRORS_H 
Py_EVAL_H Py_FILEOBJECT_H Py_FILEUTILS_H Py_FLOATOBJECT_H Py_FORCE_DOUBLE
Py_FORCE_EXPANSION Py_FORMAT_PARSETUPLE Py_FRAMEOBJECT_H Py_FUNCOBJECT_H 
Py_GCC_ATTRIBUTE Py_GE Py_GENOBJECT_H Py_GETENV Py_GRAMMAR_H Py_GT
Py_HUGE_VAL Py_IMPORT_H Py_INCREF Py_INTRCHECK_H Py_INVALID_SIZE Py_ISALNUM 
Py_ISALPHA Py_ISDIGIT Py_IS_FINITE Py_IS_INFINITY Py_ISLOWER Py_IS_NAN
Py_ISSPACE Py_ISUPPER Py_ISXDIGIT Py_ITEROBJECT_H Py_LE Py_LISTOBJECT_H 
Py_LL Py_LOCAL Py_LOCAL_INLINE Py_LONGINTREPR_H Py_LONGOBJECT_H Py_LT
Py_MARSHAL_H Py_MARSHAL_VERSION Py_MATH_E Py_MATH_PI Py_MEMCPY 
Py_MEMORYOBJECT_H Py_METAGRAMMAR_H Py_METHODOBJECT_H Py_MODSUPPORT_H 
Py_MODULEOBJECT_H
Py_NAN Py_NE Py_NODE_H Py_OBJECT_H Py_OBJIMPL_H Py_OPCODE_H Py_OSDEFS_H 
Py_OVERFLOWED Py_PARSETOK_H Py_PGEN_H Py_PGENHEADERS_H Py_PRINT_RAW
Py_PYARENA_H Py_PYDEBUG_H Py_PYFPE_H Py_PYGETOPT_H Py_PYMATH_H Py_PYMEM_H 
Py_PYPORT_H Py_PYSTATE_H Py_PYTHON_H Py_PYTHONRUN_H Py_PYTHREAD_H
Py_PYTIME_H Py_RANGEOBJECT_H Py_REFCNT Py_REF_DEBUG Py_RETURN_FALSE 
Py_RETURN_INF Py_RETURN_NAN Py_RETURN_NONE Py_RETURN_TRUE Py_SAFE_DOWNCAST
Py_SET_ERANGE_IF_OVERFLOW Py_SET_ERRNO_ON_MATH_ERROR Py_SETOBJECT_H Py_SIZE 
Py_SLICEOBJECT_H Py_STRCMP_H Py_STRTOD_H Py_STRUCTMEMBER_H Py_STRUCTSEQ_H
Py_SYMTABLE_H Py_SYSMODULE_H Py_TOKEN_H Py_TOLOWER Py_TOUPPER 
Py_TPFLAGS_BASE_EXC_SUBCLASS Py_TPFLAGS_BASETYPE Py_TPFLAGS_BYTES_SUBCLASS
Py_TPFLAGS_DEFAULT Py_TPFLAGS_DICT_SUBCLASS Py_TPFLAGS_HAVE_GC 
Py_TPFLAGS_HAVE_STACKLESS_EXTENSION Py_TPFLAGS_HAVE_VERSION_TAG 
Py_TPFLAGS_HEAPTYPE
Py_TPFLAGS_INT_SUBCLASS Py_TPFLAGS_IS_ABSTRACT Py_TPFLAGS_LIST_SUBCLASS 
Py_TPFLAGS_LONG_SUBCLASS Py_TPFLAGS_READY Py_TPFLAGS_READYING

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Tom Christiansen wrote:
 So keeping your preamble bits, I might have considered doing it
 this way if it were me doing it:
 
 #define _Py_UNICODE_IS_SURROGATE
 #define _Py_UNICODE_IS_LEAD_SURROGATE
 #define _Py_UNICODE_IS_TRAIL_SURROGATE
 #define _Py_UNICODE_JOIN_SURROGATES
 
 But I also come from a culture that uses more underscores than you guys tend 
 to, as shown in some of the macro names shown below from utf8.h file.  I find
 that most projects use more underscores in uppercase names than Python does. 
 :)

The reasoning behind e.g. ISSURROGATE is that those names originate
from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE
macros which in return stem from the C APIs of the same names
(see unicodeobject.h for reference).

Regarding low/high vs. lead/trail: The Unicode database uses the
terms low/high and we do in Python as well, so let's stick with
those.

What I don't understand is why those macros should be declared
private to Python (with the leading underscore). They are quite
useful for extensions implementing codecs or other transformations
as well.

BTW: I think the other issues mentioned in the discussion are more
important to get right, than the names of those macros.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Tom Christiansen

Tom Christiansen tchr...@perl.com added the comment:

Marc-Andre Lemburg rep...@bugs.python.org wrote
   on Tue, 16 Aug 2011 12:11:22 -: 

 The reasoning behind e.g. ISSURROGATE is that those names originate
 from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE
 macros which in return stem from the C APIs of the same names
 (see unicodeobject.h for reference).

I eventually figured that part out in the larger context.  
Makes sense looked at that way.

 Regarding low/high vs. lead/trail: The Unicode database uses the terms
 low/high and we do in Python as well, so let's stick with those.

Yes, those are their block assignments,  Block=High_Surrogates and 
Block=Low_Surrogates.
I just thought I should mention that in the time since those were invented 
(which cannot
be changed), after using them in real code for some years, their lingo seems to 
have 
evolved away from those initial names and toward lead/trail as less confusing.

 What I don't understand is why those macros should be declared
 private to Python (with the leading underscore). They are quite
 useful for extensions implementing codecs or other transformations
 as well.

I was wondering about that myself.  Beyond there being a lot fewer of those
private macros in the Python *.h files, they also seem to be of rather different
character than the iswhatever() macros:

$ perl -nle '/^\s*#\s*define\s+(_Py_[\p{Lu}_]+)\b/ and print $1' *.h | sort 
-dfu | fmt -160
_Py_ANNOTATE_BARRIER_DESTROY _Py_ANNOTATE_BARRIER_INIT 
_Py_ANNOTATE_BARRIER_WAIT_AFTER _Py_ANNOTATE_BARRIER_WAIT_BEFORE 
_Py_ANNOTATE_BENIGN_RACE
_Py_ANNOTATE_BENIGN_RACE_SIZED _Py_ANNOTATE_BENIGN_RACE_STATIC 
_Py_ANNOTATE_CONDVAR_LOCK_WAIT _Py_ANNOTATE_CONDVAR_SIGNAL 
_Py_ANNOTATE_CONDVAR_SIGNAL_ALL
_Py_ANNOTATE_CONDVAR_WAIT _Py_ANNOTATE_ENABLE_RACE_DETECTION 
_Py_ANNOTATE_EXPECT_RACE _Py_ANNOTATE_FLUSH_STATE _Py_ANNOTATE_HAPPENS_AFTER
_Py_ANNOTATE_HAPPENS_BEFORE _Py_ANNOTATE_IGNORE_READS_AND_WRITES_BEGIN 
_Py_ANNOTATE_IGNORE_READS_AND_WRITES_END _Py_ANNOTATE_IGNORE_READS_BEGIN
_Py_ANNOTATE_IGNORE_READS_END _Py_ANNOTATE_IGNORE_SYNC_BEGIN 
_Py_ANNOTATE_IGNORE_SYNC_END _Py_ANNOTATE_IGNORE_WRITES_BEGIN 
_Py_ANNOTATE_IGNORE_WRITES_END
_Py_ANNOTATE_MUTEX_IS_USED_AS_CONDVAR _Py_ANNOTATE_NEW_MEMORY 
_Py_ANNOTATE_NO_OP _Py_ANNOTATE_PCQ_CREATE _Py_ANNOTATE_PCQ_DESTROY 
_Py_ANNOTATE_PCQ_GET
_Py_ANNOTATE_PCQ_PUT _Py_ANNOTATE_PUBLISH_MEMORY_RANGE 
_Py_ANNOTATE_PURE_HAPPENS_BEFORE_MUTEX _Py_ANNOTATE_RWLOCK_ACQUIRED 
_Py_ANNOTATE_RWLOCK_CREATE
_Py_ANNOTATE_RWLOCK_DESTROY _Py_ANNOTATE_RWLOCK_RELEASED 
_Py_ANNOTATE_SWAP_MEMORY_RANGE _Py_ANNOTATE_THREAD_NAME 
_Py_ANNOTATE_TRACE_MEMORY
_Py_ANNOTATE_UNPROTECTED_READ _Py_ANNOTATE_UNPUBLISH_MEMORY_RANGE _Py_AS_GC 
_Py_CHECK_REFCNT _Py_COUNT_ALLOCS_COMMA _Py_DEC_REFTOTAL _Py_DEC_TPFREES
_Py_INC_REFTOTAL _Py_INC_TPALLOCS _Py_INC_TPFREES _Py_PARSE_PID 
_Py_REF_DEBUG_COMMA _Py_SET_EDOM_FOR_NAN

 BTW: I think the other issues mentioned in the discussion are more
 important to get right, than the names of those macros.

Yup.  Just paint it red. :)

--tom

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

I'm reposting my patch from #12751. I think that it's simpler than belopolsky's 
patch: it doesn't add public macros in unicodeobject.h and don't add the 
complex Py_UNICODE_NEXT() macro. My patch only adds private macros in 
unicodeobject.c to factorize the code.

I don't want to add public macros because with the stable API and with the PEP 
393, we are trying to hide the Py_UNICODE type and PyUnicodeObject internals. 
In belopolsky's patch, only Py_UNICODE_NEXT() is used outside unicodeobject.c.

Copy/paste of the initial message of my issue #12751 (msg142108):
---
A lot of code is duplicated in unicodeobject.c to manipulate (encode/decode) 
surrogates. Each function has from one to three different implementations. The 
new decode_ucs4() function adds a new implementation. Attached patch replaces 
this code by macros.

I think that only the implementations of IS_HIGH_SURROGATE and IS_LOW_SURROGATE 
are important for speed. ((ch  0xFC00UL) == 0xD800) (from decode_ucs4) is 
*a little bit* faster than (0xD800 = ch  ch = 0xDBFF) on my CPU (Atom Z520 
@ 1.3 GHz): running test_unicode 4 times takes ~54 sec instead of ~57 sec (-3%).

These 3 macros have to be checked, I wrote the first one:

#define IS_SURROGATE(ch) (((ch)  0xF800UL) == 0xD800)
#define IS_HIGH_SURROGATE(ch) (((ch)  0xFC00UL) == 0xD800)
#define IS_LOW_SURROGATE(ch) (((ch)  0xFC00UL) == 0xDC00)

I added cast to Py_UCS4 in COMBINE_SURROGATES to avoid integer overflow if 
Py_UNICODE is 16 bits (narrow build). It's maybe useless.

#define COMBINE_SURROGATES(ch1, ch2) \
 (Py_UCS4)(ch1)  0x3FF)  10) | ((Py_UCS4)(ch2)  0x3FF)) + 0x1)

HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument has been 
preproceed to fit in [0; 0x]. I added this requirement in the comment of 
these macros. It would be better to have only one macro to do the two 
operations, but because *p++ (dereference and increment) is usually used, I 
prefer to avoid one unique macro (I don't like passing *p++ in a macro using 
its argument more than once).

Or we may add a third macro using HIGH_SURROGATE and LOW_SURROGATE.

I rewrote the main loop of PyUnicode_EncodeUTF16() to avoid an useless test on 
ch2 on narrow build.

I also added a IS_NONBMP macro just because I prefer macro over hardcoded 
constants.
---

--
Added file: http://bugs.python.org/file22915/unicode_macros.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

STINNER Victor wrote:
 
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 I'm reposting my patch from #12751. I think that it's simpler than 
 belopolsky's patch: it doesn't add public macros in unicodeobject.h and don't 
 add the complex Py_UNICODE_NEXT() macro. My patch only adds private macros in 
 unicodeobject.c to factorize the code.
 
 I don't want to add public macros because with the stable API and with the 
 PEP 393, we are trying to hide the Py_UNICODE type and PyUnicodeObject 
 internals. In belopolsky's patch, only Py_UNICODE_NEXT() is used outside 
 unicodeobject.c.

PEP 393 is an optional feature for extension writers. If they don't
need PEP 393 style stable ABIs and want to use the macros, they
should be able to. I'm therefore -1 on making them private.

Regarding separating adding the various surrogate macros and
the next-macros: I don't see a problem with adding both in
Python 3.3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Marc-Andre Lemburg wrote:
 
 Marc-Andre Lemburg m...@egenix.com added the comment:
 
 STINNER Victor wrote:

 STINNER Victor victor.stin...@haypocalc.com added the comment:

 I'm reposting my patch from #12751. I think that it's simpler than 
 belopolsky's patch: it doesn't add public macros in unicodeobject.h and 
 don't add the complex Py_UNICODE_NEXT() macro. My patch only adds private 
 macros in unicodeobject.c to factorize the code.

 I don't want to add public macros because with the stable API and with the 
 PEP 393, we are trying to hide the Py_UNICODE type and PyUnicodeObject 
 internals. In belopolsky's patch, only Py_UNICODE_NEXT() is used outside 
 unicodeobject.c.
 
 PEP 393 is an optional feature for extension writers. If they don't
 need PEP 393 style stable ABIs and want to use the macros, they
 should be able to. I'm therefore -1 on making them private.

Sorry, I mean PEP 384, not PEP 393. Whether PEP 393 will turn out
to be a workable solution has yet to be seen, but that's a
different subject. In any case, Py_UNICODE and access macros
for PyUnicodeObject are in wide-spread use, so trying to hide
them won't work until we reach Py4k.

 Regarding separating adding the various surrogate macros and
 the next-macros: I don't see a problem with adding both in
 Python 3.3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

My patch version 2: don't test for a specific major version of an OS, test only 
its name. My patch now changes also tests for FreeBSD, NetBSD, OpenBSD, (...), 
and the _expectations list in regrtest.py.

--
Added file: http://bugs.python.org/file22916/linux3-v2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


Removed file: http://bugs.python.org/file22916/linux3-v2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
Removed message: http://bugs.python.org/msg142225

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

(oops, msg142225 was for issue #12326)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

The code review links point to something weird.  Victor, can you upload your 
patch for review?

My first impression is that your patch does not accomplish much beyond 
replacing some literal expressions with macros.  What I wanted to achieve with 
this issue was to enable writing code without #ifdef Py_UNICODE_WIDE branches.  
In your patch these branches seem to still be there and in fact it appears that 
new code is longer than the old one (I am not sure why, but I see more '+' than 
'-'s in your patch.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 The code review links point to something weird.

That's because I posted a patch for another issue. It's the patch set 5, not 
the patch set 6 :-)

Direct link:
http://bugs.python.org/review/10542/patch/3174/9874

 My first impression is that your patch does not accomplish much beyond
 replacing some literal expressions with macros.

Yes, and it avoids the duplication of some code patterns, as explained in my 
message. I would like to avoid constants in the code. Some macros are *a little 
bit* faster than the current code.

 What I wanted to achieve with this issue was to enable writing code
 without #ifdef Py_UNICODE_WIDE branches.

Yes, and I think that it's better to split this issue in two steps:

 1- add macros for the surrogates (test, join, ...)
 2- Py_UNICODE_NEXT()

 In your patch these branches seem to still be there
 and in fact it appears that new code is longer than the old one

Yes, the code adds more lines than it removes. Is it a problem? My goal is to 
have more readable code (easier to maintain).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-16 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and 
Py_UNICODE_JOIN_SURROGATES can be committed without trailing _ in 3.3 and with 
trailing _ in 2.7/3.2.  They should go in unicodeobject.h and be public in 3.3+.

Regarding the name, it would be fine with me to use 
PyUNICODE_IS_HIGH_SURROGATE.  Other IS* macros don't use spaces, but 
JOIN_SURROGATES and other proposed macros (e.g. PUT_NEXT/WRITE_NEXT) do.  Also 
these macros are not related to any existing API like e.g. isalpha.  I think 
HIGH/LOW are fine, we can mention lead/trail in the doc.

Regarding the implementation, we could use Victor's one if it's faster and it 
has no other side effects.

Regarding the other macros:
 * _Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have agreed 
about the name they can go in.  They can be private in all the 3 branches and 
made public in 3.4 if they work well;
 * IS_NONBMP doesn't simplify much the code but makes it more readable.  ICU 
has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we 
add this macro it might be ok to check for non-BMP;
 * I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with _Py_UNICODE_NEXT.  
If they are they should get a better name because the current one is not clear 
about what they do.


Unless someone disagrees I'll prepare a patch with 
PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for 
unicodeobject.h, using them where necessary, using with Victor implementation 
and commit it (after a review).

We can think about the rest later.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-15 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

See also #12751.

--
nosy: +tchrist

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-15 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

A PEP 393 draft implementation is available at 
https://bitbucket.org/t0rsten/pep-393/ (branch pep-393); if this gets into 3.3, 
this issue will be outdated: there won't be narrow builds of Python anymore 
(nor will there be wide builds).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2011-08-15 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

That's a really good news.
Some Unicode issues can still be fixed on 2.7 and 3.2 though.
FWIW I was planning to look at this and #9200 in the following days and see if 
I can fix them.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-30 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 Actually, it looks like PEP 3131 and the Language Reference [1] still
 disagree.  The latter says:
 
 identifier  ::=  id_start id_continue*
 
 which should probably be
 
 identifier  ::=  xid_start xid_continue*
 
 instead.

Interesting. XID_* is being used in the PEP since r57023, whereas the
documentation was added in r57824. In any case, this is now fixed in
r87575/r87576.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-30 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

 I think the proposal is that fixing this minefield can wait until
 Python 3.3 (or even 3.4, or later).

That is what I was thinking.  (Alex: You might not know that Martin
was the main proponent of non-ASCII identifiers, so this assessment
should have some weight.)

 I'm thinking about an approach of a variable representation:
 one, two, or four bytes, depending on the widest character that
 appears in the string. I think it can be arranged to make this mostly
 backwards-compatible with existing APIs, so it doesn't need to wait
 for py4k, IMO. OTOH, I'm not sure I'll make it for 3.3.

That is an interesting idea.  I would be interested in helping out
when you'll implement it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 I am attaching a patch for commit review.  I added an underscore prefix to 
 all new macros.  This way I am not introducing new features and we will have 
 a full release cycle to come up with better names.  i would just note that 
 next terminology is consistent with PyDict_Next and _PySet_NextEntry.  The 
 latter suggests that Py_UNICODE_NEXT_UCS4 may be a better choice.

I don't think this should go into 3.2. The macros have the potential
of subtly changing Python semantics when used in places that previously
did not support auto-joining surrogates. Let's wait for 3.3 with the
change.

Some comments:

* The macros still need some more attention to enhance their performance.

* For consistency, I'd choose names Py_UNICODE_READ_NEXT()
  and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and
  Py_UNICODE_PUT_NEXT().

* Py_UNICODE_JOIN_SURROGATES() either needs to go away completely
  (and be integrated straight into the other macros), or be renamed
  to Py_UCS4_JOIN_SURROGATES(), since it doesn't return Py_UNICODE
  values

* The macros need to be carefully documented, both in unicodeobject.h
  and the general docs.

* Your _Py_UNICODE_PUT_NEXT() implementation is missing a few casts
  to turn ch into a Py_UNICODE/Py_UCS4 value.

* Same for your _Py_UNICODE_NEXT() to make sure that the return
  value is indeed a Py_UNICODE value.

* In general, we should probably be clear on the allowed input
  and define the output types in the documentation.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

 Let's wait for 3.3 with the change.

Definitely.

--
nosy: +georg.brandl
versions: +Python 3.3 -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Wed, Dec 29, 2010 at 10:00 AM, Georg Brandl rep...@bugs.python.org wrote:
..

 Let's wait for 3.3 with the change.

 Definitely.

Does this also mean that the numerous surrogates related bugs should
wait until 3.3 as well? (See issues #9200 and #10521.)

This patch was just a stepping stone for the bug fixes.   I
deliberately kept the code changes to the minimum sufficient to
demonstrate and test the new macros.  I would not mind restricting the
patch further by limiting it to the header file changes so that the
macros can be used to fix bugs.  Fixing the bugs in the old verbose
style does not seem feasible.

Note that surrogate bugs are not as exotic as they seem.  For example,
on a wide build I can do

42

but on a narrow build,

Traceback (most recent call last):
  File stdin, line 1, in module
  File string, line 1
퐀 = 42
   ^
SyntaxError: invalid character in identifier

So at the moment, narrow and wide builds implement two different languages.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Georg Brandl

Georg Brandl ge...@python.org added the comment:

That bug already strikes me as quite exotic.

You need to at least address Marc-Andre's remarks, and to give an overview of 
what else you'd like to change as well, and how this could affect semantics.

Remember that the next release is already a release candidate.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Wed, Dec 29, 2010 at 7:19 AM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
 * The macros still need some more attention to enhance their performance.

Although I made your suggested change from '-' to '', I seriously
doubt that this would make any difference on modern CPUs.  Why do you
think these macros are performance critical?  Users with lots of
supplementary characters in their files are probably better off with a
wide build where Py_UNICODE_NEXT() is just *ptr++ and can hardly be
further optimized.  Higher performance algorithms are possible, but
those should probably do some loop unrolling and/or process more than
one character at a time.  At this point, however it is too soon to
worry about optimization before we even know where these macros will
be used.

 * For consistency, I'd choose names Py_UNICODE_READ_NEXT()
  and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and
  Py_UNICODE_PUT_NEXT().


I would leave it for you and Raymond to reach a consensus.  My
understanding is that Raymond does not want next in the name, so
your suggestion still conflicts with that.  I would mildly prefer
GET/PUT over READ/WRITE because the latter suggests multiple
characters.

As discussed before, the macro prefix does not imply the return value.
 Compare this to Py_UNICODE_ISSPACE() and friends or pretty much any
other Py_UNICODE_ macro.   Note that I added a leading underscore to
Py_UNICODE_JOIN_SURROGATES and other new macros, so there is no
immediate pressure to get the names perfect.

 * The macros need to be carefully documented, both in unicodeobject.h
  and the general docs.


I've added a description above _Py_UNICODE_*NEXT macros.  I would
really like to see these macros in private use for a while before they
are published for general audience.  I'll add a comment describing
_Py_UNICODE_JOIN_SURROGATES.  The remaining macros seem to be fairly
self-explanatory (unlike, say Py_UNICODE_ISDIGIT or Py_UNICODE_ISTITLE
which are not documented in unicodeobject.h.)

Explicit downcasts would probably make sense, for example *(ptr)++ =
(Py_UNICODE)ch instead of *(ptr)++ = ch, but I don't think we need
explicit casts say in Py_UCS4 code = (ch) - 0x1; where they can
mask coding errors.

I also looked for the use of casts elsewhere in unicodeobject.h and
the following does not look right:

#define Py_UNICODE_ISSPACE(ch) \
((ch)  128U ? _Py_ascii_whitespace[(ch)] : _PyUnicode_IsWhitespace(ch))

It looks like this won't work right if ch is a signed char.

 * Same for your _Py_UNICODE_NEXT() to make sure that the return
  value is indeed a Py_UNICODE value.


The return value of _Py_UNICODE_NEXT()  is *not* Py_UNICODE, it is
Py_UCS4 and as far as I can see, every conditional branch in narrow
case has an explicit cast.  In the wide case, I don't think we want an
explicit cast because ptr should already be Py_UCS4* and if it is not,
it may be a coding error that we don't want to mask.

 * In general, we should probably be clear on the allowed input
  and define the output types in the documentation.

I agree.  I'll add a note that ptr and end should be Py_UNICODE*.  I
am not sure what we should say about ch argument.  If we add casts,
the macro will accept anything, but we should probably document it as
expecting Py_UCS4.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, Nov 27, 2010 at 5:24 PM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
 Perhaps we should allow ord() to work on surrogates in
 UCS4 builds as well. That would reduce the number of
 surprises.


This is an interesting idea, however, having surrogates in UCS4 builds
will sooner or later lead to surprises such as

Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed

I though UCS4 (or more properly, UTF-32) did not allow encoding of
surrogate code points.

It is somewhat bothersome that a valid string literal such as
'\uD800\uDC00' in narrow build is subtly invalid in wide build.  It
would probably be better if  '\uD800\uDC00'  was either rejected on a
wide build, or interpreted as a single character so that

True

on any build.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

The example in my previous message should have been:

 '\U0001' == '\uD800\uDC00'
True

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Wed, Dec 29, 2010 at 11:36 AM, Georg Brandl rep...@bugs.python.org wrote:
..
 That bug already strikes me as quite exotic.

Would it look as exotic if presented like this?

  File stdin, line 1
̀ = 5
   ^
SyntaxError: invalid character in identifier
(works on a wide build)

Note that with few exceptions, pretty much anything you can do with
supplementary characters will produce different results in wide and
narrow builds.  This includes all character type methods (isalpha,
isdigit, etc.), transformations such as case folding or normalization,
text formatting, etc, etc.

When I suggested on python-dev that supplementary character support on
narrow builds is not worth violating fundamental invariants such as
len(chr(i)) == 1, pretty much everyone said that Python should support
full Unicode regardless of build.  When it comes to fixing specific
differences between builds, I hear that these differences are not
important because no one is using supplementary characters.

This example is less exotic than say str.center() or str.swapcase()
not because it involves less exotic characters - all non-BMP
characters are exotic by definition - but because it involves the core
definition of the Python language.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

I should stop using e-mail to reply to bug reports!  The mangled example was

 ̀ = 5
  File stdin, line 1
̀ = 5
   ^
SyntaxError: invalid character in identifier

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


Added file: http://bugs.python.org/file20190/issue10542a.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Le mercredi 29 décembre 2010 à 19:26 +, Alexander Belopolsky a
écrit :
 Would it look as exotic if presented like this?
 
   File stdin, line 1
 ̀ = 5
^
 SyntaxError: invalid character in identifier
 (works on a wide build)

Use non-ASCII identifiers is exotic. Use non-BMP identifiers is
crazy :-) Seriously, it can wait 3.3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Wed, Dec 29, 2010 at 3:36 PM, STINNER Victor rep...@bugs.python.org wrote:
..
 Use non-ASCII identifiers is exotic. Use non-BMP identifiers is
 crazy :-)

Hmm, we clearly disagree on what crosses the boundary of the mental
norm.   IMHO, it is crazy to require users to care which plane their
characters come from or whether their programs will be run on a wide
or a narrow build.  I see nothing wrong with a desire to use
characters from say Mathematical Alphanumeric Symbols block if that
makes some Python expressions look more like the mathematical formulas
that they represent.  However, it is not about any particular usage,
but about the language definition.  I don't remember even a suggestion
during PEP 3131 discussion that non-BMP characters should be excluded
from identifiers wholesale.

In any case, can someone remind me what was the use case that
motivated chr(i) returning a two-character string for i  0x?  I
think we should either stop pretending that narrow builds can handle
non-BMP characters and disallow them in Python strings or we should
try to fix the bugs associated with them.

 Seriously, it can wait 3.3.

What exactly can wait until 3.3?  The presented patch introduces no
user visible changes.  It is only a stepping stone to restoring some
sanity in a way supplementary characters are treated by narrow builds.
 At the moment, it is a mine field: you can easily produce surrogate
pairs from string literals and codecs, but when you start using them,
you have 50% chance that things will blow up, 40% chance of getting
wrong result and maybe 10% chance that it will work.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 Seriously, it can wait 3.3.
 
 What exactly can wait until 3.3?  The presented patch introduces no
 user visible changes.  It is only a stepping stone to restoring some
 sanity in a way supplementary characters are treated by narrow builds.
  At the moment, it is a mine field: you can easily produce surrogate
 pairs from string literals and codecs, but when you start using them,
 you have 50% chance that things will blow up, 40% chance of getting
 wrong result and maybe 10% chance that it will work.

I think the proposal is that fixing this minefield can wait until
Python 3.3 (or even 3.4, or later).

I plan to propose a complete redesign of the representation of Unicode
strings, which may well make this entire set of changes obsolete.

As for language definition: I think the definition is quite clear
and unambiguous. It may be that Python 3.2 doesn't fully implement it.

IOW: relax.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Wed, Dec 29, 2010 at 8:02 PM, Martin v. Löwis rep...@bugs.python.org wrote:
..

 I plan to propose a complete redesign of the representation of Unicode
 strings, which may well make this entire set of changes obsolete.


Are you serious?  This sounds like a py4k idea.  Can you give us a
hint on what the new representation will be?  Meanwhile, what it your
recommendation for application developers?  Should they attempt to fix
the code that assumes len(chr(i)) == 1?  Should text processing
applications designed to run on a narrow build simply reject non-BMP
text? Should application writers avoid using str.isxyz() methods?

 As for language definition: I think the definition is quite clear
 and unambiguous. It may be that Python 3.2 doesn't fully implement it.


Given that until recently (r87433) the PEP and the reference manual
disagreed on the definition, I have to ask what definition you refer
to.  What Python 3.2 (or rather 3.1) implements, however is important
because it has been declared to be *the* definition of the Python
language regardless of what PEPs docs have to say.

 IOW: relax.

This is the easy part. :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Wed, Dec 29, 2010 at 9:38 PM, Alexander Belopolsky
rep...@bugs.python.org wrote:
..
 Given that until recently (r87433) the PEP and the reference manual
 disagreed on the definition,

Actually, it looks like PEP 3131 and the Language Reference [1] still
disagree.  The latter says:

identifier  ::=  id_start id_continue*

which should probably be

identifier  ::=  xid_start xid_continue*

instead.
[1] http://docs.python.org/py3k/reference/lexical_analysis.html#identifiers

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-29 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 Are you serious?  This sounds like a py4k idea.  Can you give us a
 hint on what the new representation will be?

I'm thinking about an approach of a variable representation:
one, two, or four bytes, depending on the widest character that
appears in the string. I think it can be arranged to make this mostly
backwards-compatible with existing APIs, so it doesn't need to wait
for py4k, IMO. OTOH, I'm not sure I'll make it for 3.3.

 Meanwhile, what it your
 recommendation for application developers?  Should they attempt to fix
 the code that assumes len(chr(i)) == 1?  Should text processing
 applications designed to run on a narrow build simply reject non-BMP
 text? Should application writers avoid using str.isxyz() methods?

Given that this is vaporware: proceed as if that idea didn't exist.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
  * this version should be slightly faster and is also easier to read:

 #define Py_UCS4_READ_CODE_POINT(ptr, end) \
..
      Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \
..
   I haven't tested it, but you get the idea.

I don't think C guarantees the order of evaluation of the operands in
bitwise expressions such as the expansion of the JOIN_SURROGATES
macro.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-28 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

I am attaching a patch for commit review.  I added an underscore prefix to all 
new macros.  This way I am not introducing new features and we will have a full 
release cycle to come up with better names.  i would just note that next 
terminology is consistent with PyDict_Next and _PySet_NextEntry.  The latter 
suggests that Py_UNICODE_NEXT_UCS4 may be a better choice.

--
assignee: lemburg - belopolsky
stage: patch review - commit review
Added file: http://bugs.python.org/file20186/issue10542.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-16 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


--
nosy: +doerwalter

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-16 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Dec 10, 2010 at 6:09 PM, Daniel Stutzbach
rep...@bugs.python.org wrote:
..
 The second check for surrogates in Py_UNICODE_PUT_NEXT is necessary, unless 
 you can prove that
 Py_UNICODE_SOME_TRANSFORMATION will never transform characters  0x1 into 
 characters 
 0x1 or vice versa.

 Can we prove will always be the case, for current and future versions of 
 Unicode, for all or almost-all of the
 transformations we care about?

Certainly not for all, but for some important transformations, I
believe Unicode Standard does promise that the transformation  maps
BMP to BMP and supplements to supplements.  For example case folding
and normalization are two important examples.

 Answering that question and figuring out what to do about it are probably 
 more trouble than it's worth.
  If a particularly point proves to be a bottleneck, we can always specialize 
 the code there later.

Agree.  It is even more likely that the applications that have to deal
with lots of supplementary characters will be better off using a wide
unicode build rather than such specialization.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-10 Thread Daniel Stutzbach

Daniel Stutzbach stutzb...@google.com added the comment:

In bltinmodule.c, it looks like some of the indentation doesn't line up?

Bikeshedding aside, it looks good to me.

I agree with Eric Smith that the first part macro name usually refers to the 
type of the first argument (or the type the first argument points to).  
Examples:

 - Py_UNICODE_ISSPACE : Py_UNICODE - int
 - Py_UNICODE_TOLOWER : Py_UNICODE - Py_UNICODE
 - Py_UNICODE_strlen: Py_UNICODE * - size_t

This is true elsewhere in the code as well:

 - PyList_GET_SIZE : PyListObject * - Py_ssize_t

Yes, I know there are some unfortunate exceptions. ;-)

I agree that it would be nice if something in the name hinted that the return 
type was Py_UCS4 though.

Marc-Andre Lemburg wrote:
  The first argument of the macro can be any array, not just
 Py_UNICODE*, but also Py_UCS4* or even int*.

It's true that macros in C do not have any type safety.  While technically 
passing a Py_UCS4 * will work, on a UCS2 build it would needlessly check the 
Py_UCS4 data for surrogates.  I think we should discourage that.

You can also technically pass a PyListObject * to PyTuple_GET_SIZE, but that's 
also not a good idea. ;-)

Alexander Belopolsky wrote:
 The issue is that once in in the process of reading the codepoint, it
 is determined whether the code point is BMP or non-BMP.  Testing the
 result again in order to write it is somewhat wasteful.  I don't
 think this would matter in practice, but would like to hear
 alternative opinions before moving further.

If the common pattern is:

 ch = Py_UNICODE_NEXT(rp, end);
 uc = Py_UNICODE_SOME_TRANSFORMATION(ch);
 Py_UNICODE_PUT_NEXT(wp, uc);

The second check for surrogates in Py_UNICODE_PUT_NEXT is necessary, unless you 
can prove that Py_UNICODE_SOME_TRANSFORMATION will never transform characters  
0x1 into characters  0x1 or vice versa.  

Can we prove will always be the case, for current and future versions of 
Unicode, for all or almost-all of the transformations we care about?

Answering that question and figuring out what to do about it are probably more 
trouble than it's worth.  If a particularly point proves to be a bottleneck, we 
can always specialize the code there later.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-07 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Daniel,

While these macros should not affect ABI, I would appreciate your feedback in 
light of your work on issue 8654.

--
nosy: +stutzbach

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-07 Thread Daniel Stutzbach

Daniel Stutzbach stutzb...@google.com added the comment:

+1 on the general idea of abstracting out repeated code.

I will take a closer look at the details within the next few days.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-03 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, Nov 27, 2010 at 6:38 PM, Raymond Hettinger
rep...@bugs.python.org wrote:
..
 I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator 
 protocol is being used.


As a data point, ICU defines U16_NEXT() for similar purpose.  I also
like ICU terminology for surrogates (lead and trail) better than
the backward high and low.  The U16_APPEND()  suggests
Py_UNICODE_APPEND instead of PUT_NEXT (this one has a virtue of not
having next in the name as well.)  I still like NEXT better than
ADVANCE because it is shorter and has an obvious PREV counterpart that
we may want to add later.

Note that ICU uses U16_ prefix for these macros even when they operate
on 32-bit characters.

More at

http://icu-project.org/apiref/icu4c/utf16_8h.html
http://userguide.icu-project.org/strings

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-12-03 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 On Sat, Nov 27, 2010 at 6:38 PM, Raymond Hettinger
 rep...@bugs.python.org wrote:
 ..
 I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator 
 protocol is being used.

 
 As a data point, ICU defines U16_NEXT() for similar purpose.  I also
 like ICU terminology for surrogates (lead and trail) better than
 the backward high and low. 

High and low are Unicode standard terms, so we should use
those.

Regarding Py_UCS4_READ_CODE_POINT: you're right that surrogates
are code points, so how about Py_UCS4_READ_NEXT() ?!

Regarding Py_UCS4_READ_NEXT() vs. Py_UNICODE_READ_NEXT(): the return
value of the macro is a Py_UCS4 value, not a Py_UNICODE value. The
first argument of the macro can be any array, not just Py_UNICODE*,
but also Py_UCS4* or even int*.

Py_UCS2_READ_NEXT() would be plain wrong :-) Also note that Python
does have a Py_UCS4 type; it doesn't have a Py_UCS2 type.

That's why we should use *Py_UCS4*_READ_NEXT().

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Raymond Hettinger wrote:
 
 Raymond Hettinger rhettin...@users.sourceforge.net added the comment:
 
 Mark, can you opine on this?

Yes, I'll have a look later today.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

I like the idea and thanks for putting work into this.

Some comments:
 
 * when using macro variables, always put the variables in parens in the 
expansion; this avoids precedence issues, weird syntax errors, etc. - even if 
it may not be necessary
 
 * a function would be cleaner, but since this code is very performance 
sensitive, I'd opt for the macro version, unless someone can prove that a 
function would be just as fast in benchmarks
 
 * the macros should be documented in the unicodeobject.h header file and 
clearly mention that ptr and end should be side-effect free and that ptr must 
an lvalue
 
 * please use the faster bitmask operators for joining surrogates, i.e. ucs4 = 
high  0x03FF)  10) | (low  0x03FF)) + 0x0001);

 * the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it 
returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES()

 * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()
 
 * in order to make the macro easier to understand, please rename it to 
Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less 
than without the macro :-)

 * this version should be slightly faster and is also easier to read:
 
#define Py_UCS4_READ_CODE_POINT(ptr, end) \
((Py_UNICODE_ISHIGHSURROGATE((ptr)[0])  \
  (ptr)  (end)  \
  Py_UNICODE_ISLOWSURROGATE((ptr)[1])) ? \
  Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \
  (Py_UCS4)*(ptr)++)

   I haven't tested it, but you get the idea.
   
BTW: You only focus on UCS2 builds. Please also make sure that these changes 
work on UCS4 builds, e.g. Py_UCS2_READ_CODE_POINT() will also work on UCS4 
builds and join code points there.

Note that UCS4 builds currently don't join surrogates, so a high and low 
surrogates appear as two code points, which they are, but given the experience 
with UCS2 builds, may not be what the user expects. So for the purpose of 
consistency we should be careful with auto-joining surrogates in UCS2.

It does make sence for ord() and the various string methods, but should be done 
with care in other cases.

In any case, we should clearly document where these macros are used and warn 
about the implications of using them in the wrong places.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
.. [I'll respond to skipped when I update the patch]

 In any case, we should clearly document where these macros are used and
 warn about the implications of using them in the wrong places.

It may be best to start with _Py_UCS2_READ_CODE_POINT() (BTW, I like
the name because it naturally lead to Py_UCS2_WRITE_CODE_POINT()
counterpart.)  The leading underscore will probably not stop early
adopters from using it and we may get some user feedback if they ask
to make these macros public.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg
rep...@bugs.python.org wrote:
..
  * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()

  * in order to make the macro easier to understand, please rename it to
  Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still
  a lot less than without the macro :-)

I am not sure Py_UCS4_ prefix is right here.  (I agree on *SURROGATE*
methods.)  The point of Py_UNICODE_NEXT(ptr, end) is that the pointers
ptr and end are Py_UNICODE* and the macro expands to *p++ on wide
builds.  Maybe Py_UNICODE_NEXT_USC4?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Alexander Belopolsky wrote:
 
 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:
 
 On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg
 rep...@bugs.python.org wrote:
 ..
  * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()

  * in order to make the macro easier to understand, please rename it to
  Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still
  a lot less than without the macro :-)
 
 I am not sure Py_UCS4_ prefix is right here.  (I agree on *SURROGATE*
 methods.)  The point of Py_UNICODE_NEXT(ptr, end) is that the pointers
 ptr and end are Py_UNICODE* and the macro expands to *p++ on wide
 builds.  Maybe Py_UNICODE_NEXT_USC4?

The idea is that the first part refers to what the macro returns (Py_UCS4)
and the read part of the name refers to moving a pointer across
an array (any array of integers).

Note that the macro can also work on Py_UCS4 arrays (even in
UCS2 builds), so it's universal in that respect.

Perhaps we should allow ord() to work on surrogates in
UCS4 builds as well. That would reduce the number of
surprises.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

 The idea is that the first part refers to what the macro
 returns (Py_UCS4) and the read part of the name refers 
 to moving a pointer across an array (any array of integers).

I thought the first part generally meant the type of the first parameter. 
Although I can go either way, especially if we add an underscore.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

 * the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since 
 it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES()
 * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()

I'm not so familiar with the prefix conventions, but wouldn't that lead users 
to think that this macro is for wide builds and that they have to use Py_UCS2_* 
macros for narrow builds? If these macros are supposed to abstract the build 
type maybe they should have a neutral prefix. (But if the conventions we use 
say otherwise I guess the best we can do is to document it properly).
 
 * in order to make the macro easier to understand, please rename it to 
 Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less 
 than without the macro :-)

The term code point is not entirely correct here. High and low surrogates are 
code points too. The right term should be 'scalar value' (but that might be 
confusing). The 'READ' bit sounds fine though, maybe 'READ_NEXT'?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, Nov 27, 2010 at 5:41 PM, Ezio Melotti rep...@bugs.python.org wrote:

 Ezio Melotti ezio.melo...@gmail.com added the comment:

 * the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since 
 it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES()
 * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()

 I'm not so familiar with the prefix conventions, but wouldn't that lead users 
 to think that this macro is for wide builds and that they have to use 
 Py_UCS2_* macros for narrow builds? If these macros are supposed to abstract 
 the build type maybe they should have a neutral prefix. (But if the 
 conventions we use say otherwise I guess the best we can do is to document it 
 properly).

When I was using the name, I did not think about argument type.
Py_UNICODE_ is just the namespace prefix used by all macros in
unicodeobject.h. Case in point: Py_UNICODE_ISALPHA() and family that
take Py_UCS4.  (I know, there is a historical reason at work here, but
why fight it?)

Functions use PyUnicode_ prefix and build specific functions use
PyUnicodeUCSx_ prefix.   As far as I can tell, there are no macros
with Py_UCS4_ prefix.  The choices I like in the order of preference
are

1. Py_UNICODE_NEXT
2. Py_UNICODE_NEXT_UCS4
3. Py_UNICODE_READ_NEXT_UCS4

I can live with anything else, though.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator 
protocol is being used.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the
 iterator protocol is being used.

You can't use the iterator protocol on a non-PyObject, and Py_UNICODE_*
(as opposed to PyUnicode_*) suggests the macro operates on a raw array
of code points.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

AFAIU the macro returns lone surrogates as they are, this means that:
  1) if the string contains only surrogate pairs, Py_UNICODE_NEXT will iterate 
on scalar values[0];
  2) if the string contains only lone surrogates, it will iterate on 
codepoints[1];
  3) if it contains both it will be half and half (i.e. scalar values if the 
surrogates are in pair, or falling back on codepoints if they aren't);
(for strings without surrogates, iterating on scalar values or codepoints is 
the same).

Is this semantic correct for all (or at least most of) the places where the 
macro will be used?
Would a stricter version (that rejects lone surrogates and iterates on scalar 
values only) be useful in addition or in alternative to Py_UNICODE_NEXT?

[0]: http://unicode.org/glossary/#unicode_scalar_value
[1]: http://unicode.org/glossary/#code_point

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-27 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

I am attaching a patch that defines Py_UNICODE_PUT_NEXT() macro (tentative 
name) and uses it to fix str.upper method.  The implementation of 
surrogate-aware str.upper shows that NEXT/PUT_NEXT abstractions may lead to 
somewhat inefficient code for by codepoint processing.  The issue is that 
once in in the process of reading the codepoint, it is determined whether the 
code point is BMP or non-BMP.  Testing the result again in order to write it is 
somewhat wasteful.  I don't think this would matter in practice, but would like 
to hear alternative opinions before moving further. (Please, don't argue over 
names - let's figure out the proper semantics first.)

--
Added file: http://bugs.python.org/file19845/issue10542-put-next.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Alexander Belopolsky

New submission from Alexander Belopolsky belopol...@users.sourceforge.net:

As discussed in issue 10521 and the sprawling len(chr(i)) = 2? thread [1] on 
python-dev, many functions in python library behave differently on narrow and 
wide builds.  While there are unavoidable differences such as the length of 
strings with non-BMP characters, many functions can work around these 
differences.  For example, the ord() function already produces integers over 
0x when given a surrogate pair as a string of length two on a narrow build. 
 Other functions such as str.isalpha(), are not yet aware of surrogates.  See 
also issue9200.

A consensus is developing that non-BMP characters support on narrow builds is 
here to stay and that naive functions should be fixed.  Unfortunately, working 
with surrogates in python code is tricky because unicode C-API does not provide 
much support and existing examples of surrogate processing look like this:

-while (u != uend  w != wend) {
-if (0xD800 = u[0]  u[0] = 0xDBFF
- 0xDC00 = u[1]  u[1] = 0xDFFF)
-{
-*w = (((u[0]  0x3FF)  10) | (u[1]  0x3FF)) + 0x1;
-u += 2;
-}
-else {
-*w = *u;
-u++;
-}
-w++;
-}

The attached patch introduces a Py_UNICODE_NEXT() macro that allows replacing 
the code above with two lines:

+while (u != uend  w != wend)
+*w++ = Py_UNICODE_NEXT(u, uend);

The patch also introduces a set of macros for manipulating the surrogates, but 
I have not started replacing more instances of verbose surrogate processing 
because I would like to first look for higher level abstractions such as 
Py_UNICODE_NEXT().  For example, there are many instances that can benefit from 
Py_UNICODE_PUT_NEXT(ptr, ch) macro that would put a UCS4 character ch into 
Py_UNICODE buffer pointed by ptr and advance ptr by 1 or 2 units as necessary.


[1] http://mail.python.org/pipermail/python-dev/2010-November/105908.html

--
assignee: belopolsky
components: Extension Modules, Interpreter Core, Unicode
files: unicode-next.diff
keywords: patch
messages: 122464
nosy: Rhamphoryncus, amaury.forgeotdarc, belopolsky, eric.smith, ezio.melotti, 
lemburg, pitrou
priority: normal
severity: normal
stage: patch review
status: open
title: Py_UNICODE_NEXT and other macros for surrogates
type: feature request
versions: Python 3.2
Added file: http://bugs.python.org/file19825/unicode-next.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


--
nosy: +haypo, loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

In addition to the proposed Py_UNICODE_NEXT and Py_UNICODE_PUT_NEXT,  
str.__format__ would also need a function that tells it how many Py_UNICODEs 
are needed to store a given Py_UCS4.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Nov 26, 2010 at 7:27 PM, Eric Smith rep...@bugs.python.org wrote:
..

 In addition to the proposed Py_UNICODE_NEXT and Py_UNICODE_PUT_NEXT, 
  str.__format__ would also need a function that tells it how many Py_UNICODEs
 are needed to store a given Py_UCS4.

Yes, this functionality is currently hidden in

unicode_aswidechar(PyUnicodeObject *unicode,
   wchar_t *w,
   Py_ssize_t size):

/* Helper function for PyUnicode_AsWideChar() and
PyUnicode_AsWideCharString():
   convert a Unicode object to a wide character string.

   - If w is NULL: return the number of wide characters (including the
nul
 character) required to convert the unicode object. Ignore size argument.
.. */

and I believe is reimplemented in a few other places.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

I'd need access to this without having to build a PyUnicodeObject, for 
efficiency. But it sounds like it does have the basic functionality I need.

For my use I'd really need it to take the result of Py_UNICODE_NEXT. Something 
like:
Py_ssize_t
Py_UNICODE_NUM_NEEDED(Py_UCS4 c)
and it would always return 1 or 2. Always 1 for a wide build, and for a narrow 
build 1 if c is in the BMP else 2. Choose a better name, of course.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Nov 26, 2010 at 7:45 PM, Eric Smith rep...@bugs.python.org wrote:
..
 For my use I'd really need it to take the result of Py_UNICODE_NEXT. 
 Something like:
 Py_ssize_t
 Py_UNICODE_NUM_NEEDED(Py_UCS4 c)
 and it would always return 1 or 2. Always 1 for a wide build, and for a narrow
 build 1 if c is in the BMP else 2. Choose a better name, of course.

Can you describe your use case in more detail?  Would
Py_UNICODE_PUT_NEXT() combined with
Py_UNICODE_CODEPOINT_COUNT(Py_UNICODE *begin, Py_UNICODE *end) solve it?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

I don't like macro having a result and using multiple instructions using the 
evil magic trick (the ,). It's harder to maintain the code and harder to 
debug than a classical function.

Don't you think that modern compilers are able to inline the code? (If not, we 
may add the right C attribute/keyword)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

The code will basically be:

  Py_UCS4 fill;

  parse_format_string(fmt, ..., fill, ...);

  /* lots more code */

  if (fill_needed) {
/* compute how many characters to reserve */
space_needed = Py_UNICODE_NUM_NEEDED(fill) *
  number_of_characters_to_fill;
  }

It would be most convenient (and require the fewest changes) if the computation 
could just use fill, instead of remembering the pointers to the beginning and 
end of fill.

Py_UNICODE_CODEPOINT_COUNT could be implemented with a primitive that does what 
I want.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Nov 26, 2010 at 8:41 PM, STINNER Victor rep...@bugs.python.org wrote:
..
 I don't like macro having a result and using multiple instructions using the 
 evil
 magic trick (the ,). It's harder to maintain the code and harder to debug 
 than
 a classical function.


You are preaching to the choir.  In fact, my first version
(issue10521-unicode-next.diff attached to  issue10521) used a
function.   I would not worry about implementation at this point,
though.  Let's find the best abstraction first.

 Don't you think that modern compilers are able to inline the code?
 (If not, we may add the right C attribute/keyword)

Not in C.  In C++, I could use a reference to the pointer incremented
by the macro, but in C, I have to use an address.  Once you take an
address of a variable, the compiler will refuse to put it in a
register.  So no, I don't think we can write an ANSI C function that
will be as efficient as the macro.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

The compiler's decision to inline something should not be related to its 
ability to put variables in a register.

But I definitely agree that we should get the abstraction right first and worry 
about the implementation later.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Fri, Nov 26, 2010 at 9:22 PM, Eric Smith rep...@bugs.python.org wrote:
..
 But I definitely agree that we should get the abstraction right first and 
 worry about
 the implementation later.

I am fairly happy with Py_UNICODE_NEXT() abstraction.  It's semantics
should be natural for users familiar with python iterators and the
fact that it expands to simply *ptr++ on wide builds makes it easy to
explain its usage.   I am note very happy about the end argument for
the following reasons:

1. Builtin next() takes the default value as a second argument.
Extension writers may expect the same from Py_UNICODE_NEXT().  The
name end should be self-explainatory though, especially to those
with an exposure to STL.

2. If  Py_UNICODE_NEXT() stays as a macro, an innocent looking
Py_UNICODE_NEXT(p, p + size) will have a hard to detect bug.  Can be
fixed by making Py_UNICODE_NEXT() a function.

I wonder whether it is best to prefix the new macros with an
underscore.  On one hand, we want to make this available to extension
writers, on the other hand, once more people start dealing with
non-BMP issues, a better abstraction may be found and we man not want
to maintain  Py_UNICODE_NEXT()  indefinitely.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Raymond,

I wonder if you would like to comment on the iterator analogy and/or on adding 
public names to C API.

--
nosy: +rhettinger

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10542] Py_UNICODE_NEXT and other macros for surrogates

2010-11-26 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

Mark, can you opine on this?

--
assignee: belopolsky - lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10542
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com