[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Benjamin Peterson benja...@python.org added the comment: Closing now. -- nosy: +benjamin.peterson resolution: - out of date status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: The PEP 393 has been accepted and merge into Python 3.3. Python 3.3 doesn't need the Py_UNICODE_NEXT macro anymore. But my macros (unicode_macros.patch) are still useful. -- versions: +Python 3.2 -Python 3.3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: Py_UNICODE_NEXT has been removed from 3.3 but it's still available and used in 2.7/3.2 (even if it's private). In order to fix #10521 on 2.7/3.2 the _Py_UNICODE_PUT_NEXT macro attached to this patch is required. -- versions: +Python 3.3 -Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: The attached patch adds the following 4 public macros to unicodeobjects.h: Py_UNICODE_IS_SURROGATE(ch) Py_UNICODE_IS_HIGH_SURROGATE(ch) Py_UNICODE_IS_LOW_SURROGATE(ch) Py_UNICODE_JOIN_SURROGATES(high, low) and documents them. Since _Py_UNICODE_NEXT is still private, I'll commit it later as part as #9200. -- Added file: http://bugs.python.org/file23000/issue10542b.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: The attached patch adds the following 4 public macros to unicodeobjects.h: Py_UNICODE_IS_SURROGATE(ch) Py_UNICODE_IS_HIGH_SURROGATE(ch) Py_UNICODE_IS_LOW_SURROGATE(ch) Py_UNICODE_JOIN_SURROGATES(high, low) and documents them. Since _Py_UNICODE_NEXT is still private, I'll commit it later as part as #9200. Looks good. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Roundup Robot devn...@psf.upfronthosting.co.za added the comment: New changeset 77171f993bf2 by Ezio Melotti in branch 'default': #10542: Add 4 macros to work with surrogates: Py_UNICODE_IS_SURROGATE, Py_UNICODE_IS_HIGH_SURROGATE, Py_UNICODE_IS_LOW_SURROGATE, Py_UNICODE_JOIN_SURROGATES. http://hg.python.org/cpython/rev/77171f993bf2 -- nosy: +python-dev ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: I attached a patch to fix the str.is* methods on #9200 that also includes the macro. Since they are not public there, I don't see a reason to do 2 separate commits on 2.7/3.2 (one for the feature and one for the fix). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: Le 17/08/2011 07:04, Ezio Melotti a écrit : As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOIN_SURROGATES can be committed without trailing _ in 3.3 and with trailing _ in 2.7/3.2. They should go in unicodeobject.h For Python 2.7 and 3.2, I would prefer to not touch a public header, and so add the macros in unicodeobject.c. and be public in 3.3+. If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, they will use to substract 0x1 themself (whereas my macros require the ordinal to be preproceed). * _Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have agreed about the name they can go in. They can be private in all the 3 branches and made public in 3.4 if they work well; Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2. * IS_NONBMP doesn't simplify much the code but makes it more readable. ICU has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we add this macro it might be ok to check for non-BMP; If you want to make it public, it's better to call it PyUNICODE_IS_BMP() (check if the argument is in U+-U+). * I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with _Py_UNICODE_NEXT. If they are they should get a better name because the current one is not clear about what they do. They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in unicodeobject.c. Unless someone disagrees I'll prepare a patch with PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review). Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not Py_UNICODE_JOIN_SURROGATES). I used the verb combine, taken from a comment in unicodeobject.c. combine is maybe better than join? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: STINNER Victor wrote: STINNER Victor victor.stin...@haypocalc.com added the comment: Le 17/08/2011 07:04, Ezio Melotti a écrit : As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOIN_SURROGATES can be committed without trailing _ in 3.3 and with trailing _ in 2.7/3.2. They should go in unicodeobject.h Ezio used two different naming schemes in his email. Please always use Py_UNICODE_... or _Py_UNICODE (not PyUNICODE_ or _PyUNICODE_). For Python 2.7 and 3.2, I would prefer to not touch a public header, and so add the macros in unicodeobject.c. Why would you want to touch Python 2.7 at all ? and be public in 3.3+. If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, they will use to substract 0x1 themself (whereas my macros require the ordinal to be preproceed). This can be done by having two definitions of the macros: one set for UCS2 builds and one for UCS4. * _Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have agreed about the name they can go in. They can be private in all the 3 branches and made public in 3.4 if they work well; Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2. Certainly not into Python 2.7. Adding macros in patch level releases is also not such a good idea. * IS_NONBMP doesn't simplify much the code but makes it more readable. ICU has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we add this macro it might be ok to check for non-BMP; If you want to make it public, it's better to call it PyUNICODE_IS_BMP() (check if the argument is in U+-U+). Py_UNICODE_IS_BMP() please. * I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with _Py_UNICODE_NEXT. If they are they should get a better name because the current one is not clear about what they do. They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in unicodeobject.c. Unless someone disagrees I'll prepare a patch with PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review). Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not Py_UNICODE_JOIN_SURROGATES). I used the verb combine, taken from a comment in unicodeobject.c. combine is maybe better than join? No, Py_UNICODE_... please ! Thanks, -- Marc-Andre Lemburg eGenix.com 2011-10-04: PyCon DE 2011, Leipzig, Germany48 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: For Python 2.7 and 3.2, I would prefer to not touch a public header, and so add the macros in unicodeobject.c. Is there some reason for this? I think it's better if we have them in the same place rather than renaming and moving them in another file between 3.2 and 3.3. If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, they will use to substract 0x1 themself (whereas my macros require the ordinal to be preproceed). If they turn out to be useful and we find a clearer name we can even make them public in 3.3, but we'll have to see about that. Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2. If they don't it won't be possible to fix #9200 in those branches (unless we decide that the bug shouldn't be fixed there, but I would rather fix it). If you want to make it public, it's better to call it PyUNICODE_IS_BMP() (check if the argument is in U+-U+). Yes, public APIs will follow the naming conventions. Not sure if it's better to check if it's a BMP char, or if it's not. They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in unicodeobject.c. What are the naming convention for private macros in the same .c file where they are used? Shouldn't they get at least a trailing _? Unless someone disagrees I'll prepare a patch with PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review). Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not Py_UNICODE_JOIN_SURROGATES). All the other macros use PyUNICODE_*. I used the verb combine, taken from a comment in unicodeobject.c. combine is maybe better than join? I like join, it's clear enough and shorter. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: Ah yes, the correct prefix for functions working on Py_UNICODE characters/strings is Py_UNICODE, not PyUNICODE, sorry. For Python 2.7 and 3.2, I would prefer to not touch a public header, and so add the macros in unicodeobject.c. Is there some reason for this? We don't add new features to stable releases. If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, they will use to substract 0x1 themself (whereas my macros require the ordinal to be preproceed). If they turn out to be useful and we find a clearer name we can even make them public in 3.3, but we'll have to see about that. I don't think that they are useful outside unicodeobject.c. Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2. If they don't it won't be possible to fix #9200 in those branches I don't think that #9200 is a bug, but more a feature request. Not sure if it's better to check if it's a BMP char, or if it's not. I prefer a shorter name and avoiding double negation: !Py_UNICODE_IS_NON_BMP(ch). What are the naming convention for private macros in the same .c file where they are used? Hopefully, there is no convention for private macros :-) Shouldn't they get at least a trailing _? Nope. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: Ezio used two different naming schemes in his email. Please always use Py_UNICODE_... or _Py_UNICODE (not PyUNICODE_ or _PyUNICODE_). Indeed, that was a typo + copy/paste. I meant to say Py_UNICODE_* and _Py_UNICODE_*. Sorry about the confusion. Why would you want to touch Python 2.7 at all ? [...] Certainly not into Python 2.7. Adding macros in patch level releases is also not such a good idea. Because it has the bug and we can fix it (the macros will be private so that we don't add any feature). Also what about 3.2? Are you saying that we should fix the bug in 3.2/3.3 only and leave 2.x alone or that you don't want the bug to be fixed in all the bug-fix releases (i.e. 2.7/3.2)? My idea is to fix the bug in 2.7/3.2/3.3 using the macros, but only make them public in 3.3 so that new features are exposed only in 3.3. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: Ezio used two different naming schemes in his email. Please always use Py_UNICODE_... or _Py_UNICODE (not PyUNICODE_ or _PyUNICODE_). Indeed, that was a typo + copy/paste. I meant to say Py_UNICODE_* and _Py_UNICODE_*. Sorry about the confusion. Good :-) Why would you want to touch Python 2.7 at all ? [...] Certainly not into Python 2.7. Adding macros in patch level releases is also not such a good idea. Because it has the bug and we can fix it (the macros will be private so that we don't add any feature). Also what about 3.2? Are you saying that we should fix the bug in 3.2/3.3 only and leave 2.x alone or that you don't want the bug to be fixed in all the bug-fix releases (i.e. 2.7/3.2)? My idea is to fix the bug in 2.7/3.2/3.3 using the macros, but only make them public in 3.3 so that new features are exposed only in 3.3. For bug fixes, you can put the macros straight into unicodeobject.c, but please leave unicodeobject.h untouched - otherwise people will mess around with these macros (even if they are private) and users will start to wonder about linker errors if they use old patch level releases of Python 2.7/3.2. Also note that some of these macros change the behavior of Python - that's good if it fixes a bug (obviously :-)), but bad if it changes areas that are correctly implemented and then suddenly expose new behavior. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: For bug fixes, you can put the macros straight into unicodeobject.c, but please leave unicodeobject.h untouched - otherwise people will mess around with these macros (even if they are private) and users will start to wonder about linker errors if they use old patch level releases of Python 2.7/3.2. OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them in unicodeobject.c. Regarding the name, other macros in unicodeobject.c don't have any prefix, so we can do the same (e.g. IS_SURROGATE) for 2.7/3.2 if that's fine. Also note that some of these macros change the behavior of Python - that's good if it fixes a bug (obviously :-)), but bad if it changes areas that are correctly implemented and then suddenly expose new behavior. After this we can fix #9200 and make narrow builds behave correctly (i.e. like wide ones) with non-BMP chars (at least in some places). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: For bug fixes, you can put the macros straight into unicodeobject.c, but please leave unicodeobject.h untouched - otherwise people will mess around with these macros (even if they are private) and users will start to wonder about linker errors if they use old patch level releases of Python 2.7/3.2. OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them in unicodeobject.c. Regarding the name, other macros in unicodeobject.c don't have any prefix, so we can do the same (e.g. IS_SURROGATE) for 2.7/3.2 if that's fine. Sure. Also note that some of these macros change the behavior of Python - that's good if it fixes a bug (obviously :-)), but bad if it changes areas that are correctly implemented and then suddenly expose new behavior. After this we can fix #9200 and make narrow builds behave correctly (i.e. like wide ones) with non-BMP chars (at least in some places). Ok. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Eric V. Smith e...@trueblade.com added the comment: On 8/17/2011 6:30 AM, Ezio Melotti wrote: OK, so in 2.7/3.2 I'll put them in unicodeobject.c, and in 3.3 I'll move them in unicodeobject.c. I believe the second file should be unicodeobject.h, correct? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: Correct. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Martin v. Löwis mar...@v.loewis.de added the comment: Also what about 3.2? Are you saying that we should fix the bug in 3.2/3.3 only and leave 2.x alone or that you don't want the bug to be fixed in all the bug-fix releases (i.e. 2.7/3.2)? Notice that the macros themselves don't fix any bugs. As for the bugs you apparently want to fix using these macros: they should be considered on a case-by-case basis. Some of your planned bug fixes may introduce incompatibilities that rule out fixing them. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: OK, so in 2.7/3.2 I'll put them in unicodeobject.c It looks like #9200 only needs Py_UNICODE_NEXT, which can be implemented without the other Py_UNICODE_*SURROGATE* macros. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Martin v. Löwis wrote: A PEP 393 draft implementation is available at https://bitbucket.org/t0rsten/pep-393/ (branch pep-393); if this gets into 3.3, this issue will be outdated: there won't be narrow builds of Python anymore (nor will there be wide builds). Even if PEP 393 should go into Py4k one day (I don't believe that such major changes can be done in a minor release), we will still have to deal with surrogates in codecs, which is where these macros will get used, so I don't see how PEP 393 relates to the idea of adding helper macros to simplify the code. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: I think the 4 macros: #define _Py_UNICODE_ISSURROGATE #define _Py_UNICODE_ISHIGHSURROGATE #define _Py_UNICODE_ISLOWSURROGATE #define _Py_UNICODE_JOIN_SURROGATES are quite straightforward and can avoid using the trailing _. Since I would like to see #9200 fixed on 3.2 (and possibly 2.7 too), would it be ok to: 1) commit the patch with the trailing _ for all the macros on 3.2(/2.7); 2) commit the patch with the trailing _ only for the _NEXT macros in 3.3; 3) fix #9200 on all these branches using the new macros (with or without _); 4) remove the trailing _ from the _NEXT macros in 3.4 if it turns out to work well; we will still have to deal with surrogates in codecs, which is where these macros will get used They will also be used in many str methods and afaiu PEP 393 should address that. I'm not sure it addresses codecs and builtin functions like chr() and ord() too. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Antoine Pitrou pit...@free.fr added the comment: I think the 4 macros: #define _Py_UNICODE_ISSURROGATE #define _Py_UNICODE_ISHIGHSURROGATE #define _Py_UNICODE_ISLOWSURROGATE #define _Py_UNICODE_JOIN_SURROGATES are quite straightforward and can avoid using the trailing _. I don't want to bikeshed, but can we have proper consistent word separation? _Py_UNICODE_IS_HIGH_SURROGATE, not _Py_UNICODE_ISHIGHSURROGATE (etc.) we will still have to deal with surrogates in codecs, which is where these macros will get used They will also be used in many str methods and afaiu PEP 393 should address that. I'm not sure it addresses codecs and builtin functions like chr() and ord() too. AFAIU, PEP 393 avoids producing surrogate pairs in the canonical internal representation (that's one of its selling points). Only the UTF-16 codecs would need to deal with surrogate pairs, in the encoded form. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER and Py_UNICODE_TOLOWER. I agree that keeping the words separate makes them more readable though. [0]: Include/unicodeobject.h:328 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti ezio.melo...@gmail.com added the comment: I think the 4 macros: #define _Py_UNICODE_ISSURROGATE #define _Py_UNICODE_ISHIGHSURROGATE #define _Py_UNICODE_ISLOWSURROGATE #define _Py_UNICODE_JOIN_SURROGATES are quite straightforward and can avoid using the trailing _. For what it's worth, I've seen Unicode documentation that talks about that prefers the terms lead surrogate and trail surrogate as being clearer than the terms high surrgoate and low surrogate. For example, from the Unicode BOM FAQ at http://unicode.org/faq/utf_bom.html Q: What are surrogates? A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D800₁₆ to DBFF₁₆, and trailing, or low, surrogates are from DC00₁₆ to DFFF₁₆. They are called surrogates, since they do not represent characters directly, but only as a pair. BTW, considering recent discussions, you might want to read: Q: Are there any 16-bit values that are invalid? A: The two values FFFE₁₆ and ₁₆ as well as the 32 values from FDD0₁₆ to FDEF₁₆ represent noncharacters. They are invalid in interchange, but may be freely used internal to an implementation. Unpaired surrogates are invalid as well, i.e. any value in the range D800₁₆ to DBFF₁₆ not followed by a value in the range DC00₁₆ to DFFF₁₆, or any value in the range DC00₁₆ to DFFF₁₆ not preceded by a value in the range D800₁₆ to DBFF₁₆. [AF] and also the answer to: Q: Are there any paired surrogates that are invalid? whose answer I here omit for brevity, as it is a table. I suspect that you guys are now increasingly sold on the answer to the next FAQ right after that one, now. :) Q: Because supplementary characters are uncommon, does that mean I can ignore them? A: Just because supplementary characters (expressed with surrogate pairs in UTF-16) are uncommon does not mean that they should be neglected. They include: * emoji symbols and emoticons, for interoperating with Japanese mobile phones * uncommon (but not unused) CJK characters, important for personal and place names * variation selectors for ideographic variation sequences * important symbols for mathematics * numerous minority scripts and historic scripts, important for some user communities Another example of using lead and trail surrogates is in the first sentence from http://icu-project.org/apiref/icu4j/com/ibm/icu/text/UTF16.html * Naming: For clarity, High and Low surrogates are called Lead and Trail in the API, which gives a better sense of their ordering in a string. offset16 and offset32 are used to distinguish offsets to UTF-16 boundaries vs offsets to UTF-32 boundaries. int char32 is used to contain UTF-32 characters, as opposed to char16, which is a UTF-16 code unit. * Roundtripping Offsets: You can always roundtrip from a UTF-32 offset to a UTF-16 offset and back. Because of the difference in structure, you can roundtrip from a UTF-16 offset to a UTF-32 offset and back if and only if bounds(string, offset16) != TRAIL. * Exceptions: The error checking will throw an exception if indices are out of bounds. Other than than that, all methods will behave reasonably, even if unmatched surrogates or out-of-bounds UTF-32 values are present. UCharacter.isLegal() can be used to check for validity if desired. * Unmatched Surrogates: If the string contains unmatched surrogates, then these are counted as one UTF-32 value. This matches their iteration behavior, which is vital. It also matches common display practice as missing glyphs (see the Unicode Standard Section 5.4, 5.5). * Optimization: The method implementations may need optimization if the compiler doesn't fold static final methods. Since surrogate pairs will form an exceeding small percentage of all the text in the world, the singleton case should always be optimized for. You can also see this reflected in the utf.h file from the ICU project as part of their C API in ICU4C: #define U_SENTINEL (-1) This value is intended for sentinel values for APIs that (take or) return single code points (UChar32). #define U_IS_UNICODE_NONCHAR(c) Is this code point a Unicode noncharacter? #define U_IS_UNICODE_CHAR(c) Is c a Unicode code point value (0..U+10) that can be assigned a character? #define U_IS_BMP(c) ((uint32_t)(c)=0x) Is this code point a BMP code point (U+..U+)? #define U_IS_SUPPLEMENTARY(c) ((uint32_t)((c)-0x1)=0xf) Is this code point a supplementary code point (U+1..U+10)? #define U_IS_LEAD(c)
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Tom Christiansen tchr...@perl.com added the comment: I now see there are lots of good things in the BOM FAQ that have come up lately regarding surrogates and other illegal characters, and about what can go in data streams. I quote a few of these from http://unicode.org/faq/utf_bom.html below: Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an *unpaired* surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter *must* treat this as an error. Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? A: If an unpaired surrogate is encountered when converting ill-formed UTF-16 data, any conformant converter must treat this as an error. By representing such an unpaired surrogate on its own, the resulting UTF-32 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order? A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of #! of at the beginning of Unix shell scripts. Q: What should I do with U+FEFF in the middle of a file? A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character. Q: How do I tag data that does not interpret U+FEFF as a BOM? A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16. Q: Why wouldn’t I always use a protocol that requires a BOM? A: Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary *nor permitted*. Any U+FEFF would be interpreted as a ZWNBSP. Do not tag every string in a database or set of fields with a BOM, since it wastes space and complicates string concatenation. Moreover, it also means two data fields may have precisely the same content, but not be binary-equal (where one is prefaced by a BOM). Somewhat frustratingly, I am now almost more confused than ever by the last two sentences here: Q: What is a UTF? A: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for UTF; the two terms are merely synonyms for the same concept. Each UTF is reversible, thus every UTF supports *lossless round tripping*: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping *must also* map all code points that are not valid Unicode characters to unique byte sequences. These invalid code points are the 66 *noncharacters* (including FFFE and ), as well as unpaired surrogates. My confusion is about the invalid code points. The first two FAQs I cite at the top are quite clear that it is illegal to have unpaired surrogates in a UTF stream. I don’t understand therefore what it saying about “must also” mapping all code points that aren’t valid Unicode characters to “unique byte sequences” to ensure
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Tom Christiansen tchr...@perl.com added the comment: Antoine Pitrou rep...@bugs.python.org wrote on Tue, 16 Aug 2011 09:18:46 -: I think the 4 macros: #define _Py_UNICODE_ISSURROGATE #define _Py_UNICODE_ISHIGHSURROGATE #define _Py_UNICODE_ISLOWSURROGATE #define _Py_UNICODE_JOIN_SURROGATES are quite straightforward and can avoid using the trailing _. I don't want to bikeshed, but can we have proper consistent word separation? _Py_UNICODE_IS_HIGH_SURROGATE, not _Py_UNICODE_ISHIGHSURROGATE (etc.) Oh good, I thought it was only me whohadtroublereadingthose. :) --tom -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Tom Christiansen tchr...@perl.com added the comment: Ezio Melotti rep...@bugs.python.org wrote on Tue, 16 Aug 2011 09:23:50 -: All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER and Py_UNICODE_TOLOWER. I agree that keeping the words separate makes them more readable though. [0]: Include/unicodeobject.h:328 I am guessing that that is not quite why those don't have underscores in them. I bet it is actually something else. Watch: % unigrep '^\s*#\s*define\s+Py_[\p{Lu}_]+\b' unicodeobject.h #define Py_UNICODEOBJECT_H #define Py_USING_UNICODE #define Py_UNICODE_WIDE #define Py_UNICODE_ISSPACE(ch) \ #define Py_UNICODE_ISLOWER(ch) _PyUnicode_IsLowercase(ch) #define Py_UNICODE_ISUPPER(ch) _PyUnicode_IsUppercase(ch) #define Py_UNICODE_ISTITLE(ch) _PyUnicode_IsTitlecase(ch) #define Py_UNICODE_ISLINEBREAK(ch) _PyUnicode_IsLinebreak(ch) #define Py_UNICODE_TOLOWER(ch) _PyUnicode_ToLowercase(ch) #define Py_UNICODE_TOUPPER(ch) _PyUnicode_ToUppercase(ch) #define Py_UNICODE_TOTITLE(ch) _PyUnicode_ToTitlecase(ch) #define Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch) #define Py_UNICODE_ISDIGIT(ch) _PyUnicode_IsDigit(ch) #define Py_UNICODE_ISNUMERIC(ch) _PyUnicode_IsNumeric(ch) #define Py_UNICODE_ISPRINTABLE(ch) _PyUnicode_IsPrintable(ch) #define Py_UNICODE_TODECIMAL(ch) _PyUnicode_ToDecimalDigit(ch) #define Py_UNICODE_TODIGIT(ch) _PyUnicode_ToDigit(ch) #define Py_UNICODE_TONUMERIC(ch) _PyUnicode_ToNumeric(ch) #define Py_UNICODE_ISALPHA(ch) _PyUnicode_IsAlpha(ch) #define Py_UNICODE_ISALNUM(ch) \ #define Py_UNICODE_COPY(target, source, length) \ #define Py_UNICODE_FILL(target, value, length) \ #define Py_UNICODE_MATCH(string, offset, substring) \ #define Py_UNICODE_REPLACEMENT_CHARACTER ((Py_UNICODE) 0xFFFD) It looks like what is actually happening there is that you started out with names of the normal ctype(3) macroish thingies: isalpha isupper islower isdigit isxdigit isalnum isspace ispunct isprint isgraph iscntrl isblank isascii toupper isblank isascii toupper tolower toascii and wanted to preserve those, which would lead to Py_UNICODE_TOLOWER and Py_UNICODE_TOUPPER, since there are no functions in the original C versions those seem to mirror. Then when you wanted more of that ilk, you sensibly kept to the same naming convention. I eyeball few exceptions to that style here: % perl -nle '/^\s*#\s*define\s+(Py_[\p{Lu}_]+)\b/ and print $1' Include/*.h | sort -dfu | fmt -150 Py_ABSTRACTOBJECT_H Py_ALIGNED Py_ALLOW_RECURSION Py_ARITHMETIC_RIGHT_SHIFT Py_ASDL_H Py_AST_H Py_ATOMIC_H Py_BEGIN_ALLOW_THREADS Py_BITSET_H Py_BLOCK_THREADS Py_BLTINMODULE_H Py_BOOLOBJECT_H Py_BYTEARRAYOBJECT_H Py_BYTES_CTYPE_H Py_BYTESOBJECT_H Py_CAPSULE_H Py_CELLOBJECT_H Py_CEVAL_H Py_CHARMASK Py_CLASSOBJECT_H Py_CLEANUP_SUPPORTED Py_CLEAR Py_CODECREGISTRY_H Py_CODE_H Py_COMPILE_H Py_COMPLEXOBJECT_H Py_CURSES_H Py_DECREF Py_DEPRECATED Py_DESCROBJECT_H Py_DICTOBJECT_H Py_DTSF_ALT Py_DTSF_SIGN Py_DTST_FINITE Py_DTST_INFINITE Py_DTST_NAN Py_END_ALLOW_RECURSION Py_END_ALLOW_THREADS Py_ENUMOBJECT_H Py_EQ Py_ERRCODE_H Py_ERRORS_H Py_EVAL_H Py_FILEOBJECT_H Py_FILEUTILS_H Py_FLOATOBJECT_H Py_FORCE_DOUBLE Py_FORCE_EXPANSION Py_FORMAT_PARSETUPLE Py_FRAMEOBJECT_H Py_FUNCOBJECT_H Py_GCC_ATTRIBUTE Py_GE Py_GENOBJECT_H Py_GETENV Py_GRAMMAR_H Py_GT Py_HUGE_VAL Py_IMPORT_H Py_INCREF Py_INTRCHECK_H Py_INVALID_SIZE Py_ISALNUM Py_ISALPHA Py_ISDIGIT Py_IS_FINITE Py_IS_INFINITY Py_ISLOWER Py_IS_NAN Py_ISSPACE Py_ISUPPER Py_ISXDIGIT Py_ITEROBJECT_H Py_LE Py_LISTOBJECT_H Py_LL Py_LOCAL Py_LOCAL_INLINE Py_LONGINTREPR_H Py_LONGOBJECT_H Py_LT Py_MARSHAL_H Py_MARSHAL_VERSION Py_MATH_E Py_MATH_PI Py_MEMCPY Py_MEMORYOBJECT_H Py_METAGRAMMAR_H Py_METHODOBJECT_H Py_MODSUPPORT_H Py_MODULEOBJECT_H Py_NAN Py_NE Py_NODE_H Py_OBJECT_H Py_OBJIMPL_H Py_OPCODE_H Py_OSDEFS_H Py_OVERFLOWED Py_PARSETOK_H Py_PGEN_H Py_PGENHEADERS_H Py_PRINT_RAW Py_PYARENA_H Py_PYDEBUG_H Py_PYFPE_H Py_PYGETOPT_H Py_PYMATH_H Py_PYMEM_H Py_PYPORT_H Py_PYSTATE_H Py_PYTHON_H Py_PYTHONRUN_H Py_PYTHREAD_H Py_PYTIME_H Py_RANGEOBJECT_H Py_REFCNT Py_REF_DEBUG Py_RETURN_FALSE Py_RETURN_INF Py_RETURN_NAN Py_RETURN_NONE Py_RETURN_TRUE Py_SAFE_DOWNCAST Py_SET_ERANGE_IF_OVERFLOW Py_SET_ERRNO_ON_MATH_ERROR Py_SETOBJECT_H Py_SIZE Py_SLICEOBJECT_H Py_STRCMP_H Py_STRTOD_H Py_STRUCTMEMBER_H Py_STRUCTSEQ_H Py_SYMTABLE_H Py_SYSMODULE_H Py_TOKEN_H Py_TOLOWER Py_TOUPPER Py_TPFLAGS_BASE_EXC_SUBCLASS Py_TPFLAGS_BASETYPE Py_TPFLAGS_BYTES_SUBCLASS Py_TPFLAGS_DEFAULT Py_TPFLAGS_DICT_SUBCLASS Py_TPFLAGS_HAVE_GC Py_TPFLAGS_HAVE_STACKLESS_EXTENSION Py_TPFLAGS_HAVE_VERSION_TAG Py_TPFLAGS_HEAPTYPE Py_TPFLAGS_INT_SUBCLASS Py_TPFLAGS_IS_ABSTRACT Py_TPFLAGS_LIST_SUBCLASS Py_TPFLAGS_LONG_SUBCLASS Py_TPFLAGS_READY Py_TPFLAGS_READYING
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Tom Christiansen wrote: So keeping your preamble bits, I might have considered doing it this way if it were me doing it: #define _Py_UNICODE_IS_SURROGATE #define _Py_UNICODE_IS_LEAD_SURROGATE #define _Py_UNICODE_IS_TRAIL_SURROGATE #define _Py_UNICODE_JOIN_SURROGATES But I also come from a culture that uses more underscores than you guys tend to, as shown in some of the macro names shown below from utf8.h file. I find that most projects use more underscores in uppercase names than Python does. :) The reasoning behind e.g. ISSURROGATE is that those names originate from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE macros which in return stem from the C APIs of the same names (see unicodeobject.h for reference). Regarding low/high vs. lead/trail: The Unicode database uses the terms low/high and we do in Python as well, so let's stick with those. What I don't understand is why those macros should be declared private to Python (with the leading underscore). They are quite useful for extensions implementing codecs or other transformations as well. BTW: I think the other issues mentioned in the discussion are more important to get right, than the names of those macros. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Tom Christiansen tchr...@perl.com added the comment: Marc-Andre Lemburg rep...@bugs.python.org wrote on Tue, 16 Aug 2011 12:11:22 -: The reasoning behind e.g. ISSURROGATE is that those names originate from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE macros which in return stem from the C APIs of the same names (see unicodeobject.h for reference). I eventually figured that part out in the larger context. Makes sense looked at that way. Regarding low/high vs. lead/trail: The Unicode database uses the terms low/high and we do in Python as well, so let's stick with those. Yes, those are their block assignments, Block=High_Surrogates and Block=Low_Surrogates. I just thought I should mention that in the time since those were invented (which cannot be changed), after using them in real code for some years, their lingo seems to have evolved away from those initial names and toward lead/trail as less confusing. What I don't understand is why those macros should be declared private to Python (with the leading underscore). They are quite useful for extensions implementing codecs or other transformations as well. I was wondering about that myself. Beyond there being a lot fewer of those private macros in the Python *.h files, they also seem to be of rather different character than the iswhatever() macros: $ perl -nle '/^\s*#\s*define\s+(_Py_[\p{Lu}_]+)\b/ and print $1' *.h | sort -dfu | fmt -160 _Py_ANNOTATE_BARRIER_DESTROY _Py_ANNOTATE_BARRIER_INIT _Py_ANNOTATE_BARRIER_WAIT_AFTER _Py_ANNOTATE_BARRIER_WAIT_BEFORE _Py_ANNOTATE_BENIGN_RACE _Py_ANNOTATE_BENIGN_RACE_SIZED _Py_ANNOTATE_BENIGN_RACE_STATIC _Py_ANNOTATE_CONDVAR_LOCK_WAIT _Py_ANNOTATE_CONDVAR_SIGNAL _Py_ANNOTATE_CONDVAR_SIGNAL_ALL _Py_ANNOTATE_CONDVAR_WAIT _Py_ANNOTATE_ENABLE_RACE_DETECTION _Py_ANNOTATE_EXPECT_RACE _Py_ANNOTATE_FLUSH_STATE _Py_ANNOTATE_HAPPENS_AFTER _Py_ANNOTATE_HAPPENS_BEFORE _Py_ANNOTATE_IGNORE_READS_AND_WRITES_BEGIN _Py_ANNOTATE_IGNORE_READS_AND_WRITES_END _Py_ANNOTATE_IGNORE_READS_BEGIN _Py_ANNOTATE_IGNORE_READS_END _Py_ANNOTATE_IGNORE_SYNC_BEGIN _Py_ANNOTATE_IGNORE_SYNC_END _Py_ANNOTATE_IGNORE_WRITES_BEGIN _Py_ANNOTATE_IGNORE_WRITES_END _Py_ANNOTATE_MUTEX_IS_USED_AS_CONDVAR _Py_ANNOTATE_NEW_MEMORY _Py_ANNOTATE_NO_OP _Py_ANNOTATE_PCQ_CREATE _Py_ANNOTATE_PCQ_DESTROY _Py_ANNOTATE_PCQ_GET _Py_ANNOTATE_PCQ_PUT _Py_ANNOTATE_PUBLISH_MEMORY_RANGE _Py_ANNOTATE_PURE_HAPPENS_BEFORE_MUTEX _Py_ANNOTATE_RWLOCK_ACQUIRED _Py_ANNOTATE_RWLOCK_CREATE _Py_ANNOTATE_RWLOCK_DESTROY _Py_ANNOTATE_RWLOCK_RELEASED _Py_ANNOTATE_SWAP_MEMORY_RANGE _Py_ANNOTATE_THREAD_NAME _Py_ANNOTATE_TRACE_MEMORY _Py_ANNOTATE_UNPROTECTED_READ _Py_ANNOTATE_UNPUBLISH_MEMORY_RANGE _Py_AS_GC _Py_CHECK_REFCNT _Py_COUNT_ALLOCS_COMMA _Py_DEC_REFTOTAL _Py_DEC_TPFREES _Py_INC_REFTOTAL _Py_INC_TPALLOCS _Py_INC_TPFREES _Py_PARSE_PID _Py_REF_DEBUG_COMMA _Py_SET_EDOM_FOR_NAN BTW: I think the other issues mentioned in the discussion are more important to get right, than the names of those macros. Yup. Just paint it red. :) --tom -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: I'm reposting my patch from #12751. I think that it's simpler than belopolsky's patch: it doesn't add public macros in unicodeobject.h and don't add the complex Py_UNICODE_NEXT() macro. My patch only adds private macros in unicodeobject.c to factorize the code. I don't want to add public macros because with the stable API and with the PEP 393, we are trying to hide the Py_UNICODE type and PyUnicodeObject internals. In belopolsky's patch, only Py_UNICODE_NEXT() is used outside unicodeobject.c. Copy/paste of the initial message of my issue #12751 (msg142108): --- A lot of code is duplicated in unicodeobject.c to manipulate (encode/decode) surrogates. Each function has from one to three different implementations. The new decode_ucs4() function adds a new implementation. Attached patch replaces this code by macros. I think that only the implementations of IS_HIGH_SURROGATE and IS_LOW_SURROGATE are important for speed. ((ch 0xFC00UL) == 0xD800) (from decode_ucs4) is *a little bit* faster than (0xD800 = ch ch = 0xDBFF) on my CPU (Atom Z520 @ 1.3 GHz): running test_unicode 4 times takes ~54 sec instead of ~57 sec (-3%). These 3 macros have to be checked, I wrote the first one: #define IS_SURROGATE(ch) (((ch) 0xF800UL) == 0xD800) #define IS_HIGH_SURROGATE(ch) (((ch) 0xFC00UL) == 0xD800) #define IS_LOW_SURROGATE(ch) (((ch) 0xFC00UL) == 0xDC00) I added cast to Py_UCS4 in COMBINE_SURROGATES to avoid integer overflow if Py_UNICODE is 16 bits (narrow build). It's maybe useless. #define COMBINE_SURROGATES(ch1, ch2) \ (Py_UCS4)(ch1) 0x3FF) 10) | ((Py_UCS4)(ch2) 0x3FF)) + 0x1) HIGH_SURROGATE and LOW_SURROGATE require that their ordinal argument has been preproceed to fit in [0; 0x]. I added this requirement in the comment of these macros. It would be better to have only one macro to do the two operations, but because *p++ (dereference and increment) is usually used, I prefer to avoid one unique macro (I don't like passing *p++ in a macro using its argument more than once). Or we may add a third macro using HIGH_SURROGATE and LOW_SURROGATE. I rewrote the main loop of PyUnicode_EncodeUTF16() to avoid an useless test on ch2 on narrow build. I also added a IS_NONBMP macro just because I prefer macro over hardcoded constants. --- -- Added file: http://bugs.python.org/file22915/unicode_macros.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: STINNER Victor wrote: STINNER Victor victor.stin...@haypocalc.com added the comment: I'm reposting my patch from #12751. I think that it's simpler than belopolsky's patch: it doesn't add public macros in unicodeobject.h and don't add the complex Py_UNICODE_NEXT() macro. My patch only adds private macros in unicodeobject.c to factorize the code. I don't want to add public macros because with the stable API and with the PEP 393, we are trying to hide the Py_UNICODE type and PyUnicodeObject internals. In belopolsky's patch, only Py_UNICODE_NEXT() is used outside unicodeobject.c. PEP 393 is an optional feature for extension writers. If they don't need PEP 393 style stable ABIs and want to use the macros, they should be able to. I'm therefore -1 on making them private. Regarding separating adding the various surrogate macros and the next-macros: I don't see a problem with adding both in Python 3.3. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Marc-Andre Lemburg wrote: Marc-Andre Lemburg m...@egenix.com added the comment: STINNER Victor wrote: STINNER Victor victor.stin...@haypocalc.com added the comment: I'm reposting my patch from #12751. I think that it's simpler than belopolsky's patch: it doesn't add public macros in unicodeobject.h and don't add the complex Py_UNICODE_NEXT() macro. My patch only adds private macros in unicodeobject.c to factorize the code. I don't want to add public macros because with the stable API and with the PEP 393, we are trying to hide the Py_UNICODE type and PyUnicodeObject internals. In belopolsky's patch, only Py_UNICODE_NEXT() is used outside unicodeobject.c. PEP 393 is an optional feature for extension writers. If they don't need PEP 393 style stable ABIs and want to use the macros, they should be able to. I'm therefore -1 on making them private. Sorry, I mean PEP 384, not PEP 393. Whether PEP 393 will turn out to be a workable solution has yet to be seen, but that's a different subject. In any case, Py_UNICODE and access macros for PyUnicodeObject are in wide-spread use, so trying to hide them won't work until we reach Py4k. Regarding separating adding the various surrogate macros and the next-macros: I don't see a problem with adding both in Python 3.3. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: My patch version 2: don't test for a specific major version of an OS, test only its name. My patch now changes also tests for FreeBSD, NetBSD, OpenBSD, (...), and the _expectations list in regrtest.py. -- Added file: http://bugs.python.org/file22916/linux3-v2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Changes by STINNER Victor victor.stin...@haypocalc.com: Removed file: http://bugs.python.org/file22916/linux3-v2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Changes by STINNER Victor victor.stin...@haypocalc.com: -- Removed message: http://bugs.python.org/msg142225 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: (oops, msg142225 was for issue #12326) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: The code review links point to something weird. Victor, can you upload your patch for review? My first impression is that your patch does not accomplish much beyond replacing some literal expressions with macros. What I wanted to achieve with this issue was to enable writing code without #ifdef Py_UNICODE_WIDE branches. In your patch these branches seem to still be there and in fact it appears that new code is longer than the old one (I am not sure why, but I see more '+' than '-'s in your patch.) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: The code review links point to something weird. That's because I posted a patch for another issue. It's the patch set 5, not the patch set 6 :-) Direct link: http://bugs.python.org/review/10542/patch/3174/9874 My first impression is that your patch does not accomplish much beyond replacing some literal expressions with macros. Yes, and it avoids the duplication of some code patterns, as explained in my message. I would like to avoid constants in the code. Some macros are *a little bit* faster than the current code. What I wanted to achieve with this issue was to enable writing code without #ifdef Py_UNICODE_WIDE branches. Yes, and I think that it's better to split this issue in two steps: 1- add macros for the surrogates (test, join, ...) 2- Py_UNICODE_NEXT() In your patch these branches seem to still be there and in fact it appears that new code is longer than the old one Yes, the code adds more lines than it removes. Is it a problem? My goal is to have more readable code (easier to maintain). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOIN_SURROGATES can be committed without trailing _ in 3.3 and with trailing _ in 2.7/3.2. They should go in unicodeobject.h and be public in 3.3+. Regarding the name, it would be fine with me to use PyUNICODE_IS_HIGH_SURROGATE. Other IS* macros don't use spaces, but JOIN_SURROGATES and other proposed macros (e.g. PUT_NEXT/WRITE_NEXT) do. Also these macros are not related to any existing API like e.g. isalpha. I think HIGH/LOW are fine, we can mention lead/trail in the doc. Regarding the implementation, we could use Victor's one if it's faster and it has no other side effects. Regarding the other macros: * _Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have agreed about the name they can go in. They can be private in all the 3 branches and made public in 3.4 if they work well; * IS_NONBMP doesn't simplify much the code but makes it more readable. ICU has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we add this macro it might be ok to check for non-BMP; * I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with _Py_UNICODE_NEXT. If they are they should get a better name because the current one is not clear about what they do. Unless someone disagrees I'll prepare a patch with PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review). We can think about the rest later. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: See also #12751. -- nosy: +tchrist ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Martin v. Löwis mar...@v.loewis.de added the comment: A PEP 393 draft implementation is available at https://bitbucket.org/t0rsten/pep-393/ (branch pep-393); if this gets into 3.3, this issue will be outdated: there won't be narrow builds of Python anymore (nor will there be wide builds). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: That's a really good news. Some Unicode issues can still be fixed on 2.7 and 3.2 though. FWIW I was planning to look at this and #9200 in the following days and see if I can fix them. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Martin v. Löwis mar...@v.loewis.de added the comment: Actually, it looks like PEP 3131 and the Language Reference [1] still disagree. The latter says: identifier ::= id_start id_continue* which should probably be identifier ::= xid_start xid_continue* instead. Interesting. XID_* is being used in the PEP since r57023, whereas the documentation was added in r57824. In any case, this is now fixed in r87575/r87576. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Georg Brandl ge...@python.org added the comment: I think the proposal is that fixing this minefield can wait until Python 3.3 (or even 3.4, or later). That is what I was thinking. (Alex: You might not know that Martin was the main proponent of non-ASCII identifiers, so this assessment should have some weight.) I'm thinking about an approach of a variable representation: one, two, or four bytes, depending on the widest character that appears in the string. I think it can be arranged to make this mostly backwards-compatible with existing APIs, so it doesn't need to wait for py4k, IMO. OTOH, I'm not sure I'll make it for 3.3. That is an interesting idea. I would be interested in helping out when you'll implement it. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Alexander Belopolsky wrote: Alexander Belopolsky belopol...@users.sourceforge.net added the comment: I am attaching a patch for commit review. I added an underscore prefix to all new macros. This way I am not introducing new features and we will have a full release cycle to come up with better names. i would just note that next terminology is consistent with PyDict_Next and _PySet_NextEntry. The latter suggests that Py_UNICODE_NEXT_UCS4 may be a better choice. I don't think this should go into 3.2. The macros have the potential of subtly changing Python semantics when used in places that previously did not support auto-joining surrogates. Let's wait for 3.3 with the change. Some comments: * The macros still need some more attention to enhance their performance. * For consistency, I'd choose names Py_UNICODE_READ_NEXT() and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and Py_UNICODE_PUT_NEXT(). * Py_UNICODE_JOIN_SURROGATES() either needs to go away completely (and be integrated straight into the other macros), or be renamed to Py_UCS4_JOIN_SURROGATES(), since it doesn't return Py_UNICODE values * The macros need to be carefully documented, both in unicodeobject.h and the general docs. * Your _Py_UNICODE_PUT_NEXT() implementation is missing a few casts to turn ch into a Py_UNICODE/Py_UCS4 value. * Same for your _Py_UNICODE_NEXT() to make sure that the return value is indeed a Py_UNICODE value. * In general, we should probably be clear on the allowed input and define the output types in the documentation. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Georg Brandl ge...@python.org added the comment: Let's wait for 3.3 with the change. Definitely. -- nosy: +georg.brandl versions: +Python 3.3 -Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Wed, Dec 29, 2010 at 10:00 AM, Georg Brandl rep...@bugs.python.org wrote: .. Let's wait for 3.3 with the change. Definitely. Does this also mean that the numerous surrogates related bugs should wait until 3.3 as well? (See issues #9200 and #10521.) This patch was just a stepping stone for the bug fixes. I deliberately kept the code changes to the minimum sufficient to demonstrate and test the new macros. I would not mind restricting the patch further by limiting it to the header file changes so that the macros can be used to fix bugs. Fixing the bugs in the old verbose style does not seem feasible. Note that surrogate bugs are not as exotic as they seem. For example, on a wide build I can do 42 but on a narrow build, Traceback (most recent call last): File stdin, line 1, in module File string, line 1 퐀 = 42 ^ SyntaxError: invalid character in identifier So at the moment, narrow and wide builds implement two different languages. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Georg Brandl ge...@python.org added the comment: That bug already strikes me as quite exotic. You need to at least address Marc-Andre's remarks, and to give an overview of what else you'd like to change as well, and how this could affect semantics. Remember that the next release is already a release candidate. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Wed, Dec 29, 2010 at 7:19 AM, Marc-Andre Lemburg rep...@bugs.python.org wrote: .. * The macros still need some more attention to enhance their performance. Although I made your suggested change from '-' to '', I seriously doubt that this would make any difference on modern CPUs. Why do you think these macros are performance critical? Users with lots of supplementary characters in their files are probably better off with a wide build where Py_UNICODE_NEXT() is just *ptr++ and can hardly be further optimized. Higher performance algorithms are possible, but those should probably do some loop unrolling and/or process more than one character at a time. At this point, however it is too soon to worry about optimization before we even know where these macros will be used. * For consistency, I'd choose names Py_UNICODE_READ_NEXT() and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and Py_UNICODE_PUT_NEXT(). I would leave it for you and Raymond to reach a consensus. My understanding is that Raymond does not want next in the name, so your suggestion still conflicts with that. I would mildly prefer GET/PUT over READ/WRITE because the latter suggests multiple characters. As discussed before, the macro prefix does not imply the return value. Compare this to Py_UNICODE_ISSPACE() and friends or pretty much any other Py_UNICODE_ macro. Note that I added a leading underscore to Py_UNICODE_JOIN_SURROGATES and other new macros, so there is no immediate pressure to get the names perfect. * The macros need to be carefully documented, both in unicodeobject.h and the general docs. I've added a description above _Py_UNICODE_*NEXT macros. I would really like to see these macros in private use for a while before they are published for general audience. I'll add a comment describing _Py_UNICODE_JOIN_SURROGATES. The remaining macros seem to be fairly self-explanatory (unlike, say Py_UNICODE_ISDIGIT or Py_UNICODE_ISTITLE which are not documented in unicodeobject.h.) Explicit downcasts would probably make sense, for example *(ptr)++ = (Py_UNICODE)ch instead of *(ptr)++ = ch, but I don't think we need explicit casts say in Py_UCS4 code = (ch) - 0x1; where they can mask coding errors. I also looked for the use of casts elsewhere in unicodeobject.h and the following does not look right: #define Py_UNICODE_ISSPACE(ch) \ ((ch) 128U ? _Py_ascii_whitespace[(ch)] : _PyUnicode_IsWhitespace(ch)) It looks like this won't work right if ch is a signed char. * Same for your _Py_UNICODE_NEXT() to make sure that the return value is indeed a Py_UNICODE value. The return value of _Py_UNICODE_NEXT() is *not* Py_UNICODE, it is Py_UCS4 and as far as I can see, every conditional branch in narrow case has an explicit cast. In the wide case, I don't think we want an explicit cast because ptr should already be Py_UCS4* and if it is not, it may be a coding error that we don't want to mask. * In general, we should probably be clear on the allowed input and define the output types in the documentation. I agree. I'll add a note that ptr and end should be Py_UNICODE*. I am not sure what we should say about ch argument. If we add casts, the macro will accept anything, but we should probably document it as expecting Py_UCS4. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Sat, Nov 27, 2010 at 5:24 PM, Marc-Andre Lemburg rep...@bugs.python.org wrote: .. Perhaps we should allow ord() to work on surrogates in UCS4 builds as well. That would reduce the number of surprises. This is an interesting idea, however, having surrogates in UCS4 builds will sooner or later lead to surprises such as Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed I though UCS4 (or more properly, UTF-32) did not allow encoding of surrogate code points. It is somewhat bothersome that a valid string literal such as '\uD800\uDC00' in narrow build is subtly invalid in wide build. It would probably be better if '\uD800\uDC00' was either rejected on a wide build, or interpreted as a single character so that True on any build. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: The example in my previous message should have been: '\U0001' == '\uD800\uDC00' True -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Wed, Dec 29, 2010 at 11:36 AM, Georg Brandl rep...@bugs.python.org wrote: .. That bug already strikes me as quite exotic. Would it look as exotic if presented like this? File stdin, line 1 ̀ = 5 ^ SyntaxError: invalid character in identifier (works on a wide build) Note that with few exceptions, pretty much anything you can do with supplementary characters will produce different results in wide and narrow builds. This includes all character type methods (isalpha, isdigit, etc.), transformations such as case folding or normalization, text formatting, etc, etc. When I suggested on python-dev that supplementary character support on narrow builds is not worth violating fundamental invariants such as len(chr(i)) == 1, pretty much everyone said that Python should support full Unicode regardless of build. When it comes to fixing specific differences between builds, I hear that these differences are not important because no one is using supplementary characters. This example is less exotic than say str.center() or str.swapcase() not because it involves less exotic characters - all non-BMP characters are exotic by definition - but because it involves the core definition of the Python language. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: I should stop using e-mail to reply to bug reports! The mangled example was ̀ = 5 File stdin, line 1 ̀ = 5 ^ SyntaxError: invalid character in identifier -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Changes by Alexander Belopolsky belopol...@users.sourceforge.net: Added file: http://bugs.python.org/file20190/issue10542a.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: Le mercredi 29 décembre 2010 à 19:26 +, Alexander Belopolsky a écrit : Would it look as exotic if presented like this? File stdin, line 1 ̀ = 5 ^ SyntaxError: invalid character in identifier (works on a wide build) Use non-ASCII identifiers is exotic. Use non-BMP identifiers is crazy :-) Seriously, it can wait 3.3. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Wed, Dec 29, 2010 at 3:36 PM, STINNER Victor rep...@bugs.python.org wrote: .. Use non-ASCII identifiers is exotic. Use non-BMP identifiers is crazy :-) Hmm, we clearly disagree on what crosses the boundary of the mental norm. IMHO, it is crazy to require users to care which plane their characters come from or whether their programs will be run on a wide or a narrow build. I see nothing wrong with a desire to use characters from say Mathematical Alphanumeric Symbols block if that makes some Python expressions look more like the mathematical formulas that they represent. However, it is not about any particular usage, but about the language definition. I don't remember even a suggestion during PEP 3131 discussion that non-BMP characters should be excluded from identifiers wholesale. In any case, can someone remind me what was the use case that motivated chr(i) returning a two-character string for i 0x? I think we should either stop pretending that narrow builds can handle non-BMP characters and disallow them in Python strings or we should try to fix the bugs associated with them. Seriously, it can wait 3.3. What exactly can wait until 3.3? The presented patch introduces no user visible changes. It is only a stepping stone to restoring some sanity in a way supplementary characters are treated by narrow builds. At the moment, it is a mine field: you can easily produce surrogate pairs from string literals and codecs, but when you start using them, you have 50% chance that things will blow up, 40% chance of getting wrong result and maybe 10% chance that it will work. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Martin v. Löwis mar...@v.loewis.de added the comment: Seriously, it can wait 3.3. What exactly can wait until 3.3? The presented patch introduces no user visible changes. It is only a stepping stone to restoring some sanity in a way supplementary characters are treated by narrow builds. At the moment, it is a mine field: you can easily produce surrogate pairs from string literals and codecs, but when you start using them, you have 50% chance that things will blow up, 40% chance of getting wrong result and maybe 10% chance that it will work. I think the proposal is that fixing this minefield can wait until Python 3.3 (or even 3.4, or later). I plan to propose a complete redesign of the representation of Unicode strings, which may well make this entire set of changes obsolete. As for language definition: I think the definition is quite clear and unambiguous. It may be that Python 3.2 doesn't fully implement it. IOW: relax. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Wed, Dec 29, 2010 at 8:02 PM, Martin v. Löwis rep...@bugs.python.org wrote: .. I plan to propose a complete redesign of the representation of Unicode strings, which may well make this entire set of changes obsolete. Are you serious? This sounds like a py4k idea. Can you give us a hint on what the new representation will be? Meanwhile, what it your recommendation for application developers? Should they attempt to fix the code that assumes len(chr(i)) == 1? Should text processing applications designed to run on a narrow build simply reject non-BMP text? Should application writers avoid using str.isxyz() methods? As for language definition: I think the definition is quite clear and unambiguous. It may be that Python 3.2 doesn't fully implement it. Given that until recently (r87433) the PEP and the reference manual disagreed on the definition, I have to ask what definition you refer to. What Python 3.2 (or rather 3.1) implements, however is important because it has been declared to be *the* definition of the Python language regardless of what PEPs docs have to say. IOW: relax. This is the easy part. :-) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Wed, Dec 29, 2010 at 9:38 PM, Alexander Belopolsky rep...@bugs.python.org wrote: .. Given that until recently (r87433) the PEP and the reference manual disagreed on the definition, Actually, it looks like PEP 3131 and the Language Reference [1] still disagree. The latter says: identifier ::= id_start id_continue* which should probably be identifier ::= xid_start xid_continue* instead. [1] http://docs.python.org/py3k/reference/lexical_analysis.html#identifiers -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Martin v. Löwis mar...@v.loewis.de added the comment: Are you serious? This sounds like a py4k idea. Can you give us a hint on what the new representation will be? I'm thinking about an approach of a variable representation: one, two, or four bytes, depending on the widest character that appears in the string. I think it can be arranged to make this mostly backwards-compatible with existing APIs, so it doesn't need to wait for py4k, IMO. OTOH, I'm not sure I'll make it for 3.3. Meanwhile, what it your recommendation for application developers? Should they attempt to fix the code that assumes len(chr(i)) == 1? Should text processing applications designed to run on a narrow build simply reject non-BMP text? Should application writers avoid using str.isxyz() methods? Given that this is vaporware: proceed as if that idea didn't exist. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg rep...@bugs.python.org wrote: .. * this version should be slightly faster and is also easier to read: #define Py_UCS4_READ_CODE_POINT(ptr, end) \ .. Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \ .. I haven't tested it, but you get the idea. I don't think C guarantees the order of evaluation of the operands in bitwise expressions such as the expansion of the JOIN_SURROGATES macro. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: I am attaching a patch for commit review. I added an underscore prefix to all new macros. This way I am not introducing new features and we will have a full release cycle to come up with better names. i would just note that next terminology is consistent with PyDict_Next and _PySet_NextEntry. The latter suggests that Py_UNICODE_NEXT_UCS4 may be a better choice. -- assignee: lemburg - belopolsky stage: patch review - commit review Added file: http://bugs.python.org/file20186/issue10542.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Changes by Alexander Belopolsky belopol...@users.sourceforge.net: -- nosy: +doerwalter ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Dec 10, 2010 at 6:09 PM, Daniel Stutzbach rep...@bugs.python.org wrote: .. The second check for surrogates in Py_UNICODE_PUT_NEXT is necessary, unless you can prove that Py_UNICODE_SOME_TRANSFORMATION will never transform characters 0x1 into characters 0x1 or vice versa. Can we prove will always be the case, for current and future versions of Unicode, for all or almost-all of the transformations we care about? Certainly not for all, but for some important transformations, I believe Unicode Standard does promise that the transformation maps BMP to BMP and supplements to supplements. For example case folding and normalization are two important examples. Answering that question and figuring out what to do about it are probably more trouble than it's worth. If a particularly point proves to be a bottleneck, we can always specialize the code there later. Agree. It is even more likely that the applications that have to deal with lots of supplementary characters will be better off using a wide unicode build rather than such specialization. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Daniel Stutzbach stutzb...@google.com added the comment: In bltinmodule.c, it looks like some of the indentation doesn't line up? Bikeshedding aside, it looks good to me. I agree with Eric Smith that the first part macro name usually refers to the type of the first argument (or the type the first argument points to). Examples: - Py_UNICODE_ISSPACE : Py_UNICODE - int - Py_UNICODE_TOLOWER : Py_UNICODE - Py_UNICODE - Py_UNICODE_strlen: Py_UNICODE * - size_t This is true elsewhere in the code as well: - PyList_GET_SIZE : PyListObject * - Py_ssize_t Yes, I know there are some unfortunate exceptions. ;-) I agree that it would be nice if something in the name hinted that the return type was Py_UCS4 though. Marc-Andre Lemburg wrote: The first argument of the macro can be any array, not just Py_UNICODE*, but also Py_UCS4* or even int*. It's true that macros in C do not have any type safety. While technically passing a Py_UCS4 * will work, on a UCS2 build it would needlessly check the Py_UCS4 data for surrogates. I think we should discourage that. You can also technically pass a PyListObject * to PyTuple_GET_SIZE, but that's also not a good idea. ;-) Alexander Belopolsky wrote: The issue is that once in in the process of reading the codepoint, it is determined whether the code point is BMP or non-BMP. Testing the result again in order to write it is somewhat wasteful. I don't think this would matter in practice, but would like to hear alternative opinions before moving further. If the common pattern is: ch = Py_UNICODE_NEXT(rp, end); uc = Py_UNICODE_SOME_TRANSFORMATION(ch); Py_UNICODE_PUT_NEXT(wp, uc); The second check for surrogates in Py_UNICODE_PUT_NEXT is necessary, unless you can prove that Py_UNICODE_SOME_TRANSFORMATION will never transform characters 0x1 into characters 0x1 or vice versa. Can we prove will always be the case, for current and future versions of Unicode, for all or almost-all of the transformations we care about? Answering that question and figuring out what to do about it are probably more trouble than it's worth. If a particularly point proves to be a bottleneck, we can always specialize the code there later. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Daniel, While these macros should not affect ABI, I would appreciate your feedback in light of your work on issue 8654. -- nosy: +stutzbach ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Daniel Stutzbach stutzb...@google.com added the comment: +1 on the general idea of abstracting out repeated code. I will take a closer look at the details within the next few days. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Sat, Nov 27, 2010 at 6:38 PM, Raymond Hettinger rep...@bugs.python.org wrote: .. I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator protocol is being used. As a data point, ICU defines U16_NEXT() for similar purpose. I also like ICU terminology for surrogates (lead and trail) better than the backward high and low. The U16_APPEND() suggests Py_UNICODE_APPEND instead of PUT_NEXT (this one has a virtue of not having next in the name as well.) I still like NEXT better than ADVANCE because it is shorter and has an obvious PREV counterpart that we may want to add later. Note that ICU uses U16_ prefix for these macros even when they operate on 32-bit characters. More at http://icu-project.org/apiref/icu4c/utf16_8h.html http://userguide.icu-project.org/strings -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Alexander Belopolsky wrote: Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Sat, Nov 27, 2010 at 6:38 PM, Raymond Hettinger rep...@bugs.python.org wrote: .. I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator protocol is being used. As a data point, ICU defines U16_NEXT() for similar purpose. I also like ICU terminology for surrogates (lead and trail) better than the backward high and low. High and low are Unicode standard terms, so we should use those. Regarding Py_UCS4_READ_CODE_POINT: you're right that surrogates are code points, so how about Py_UCS4_READ_NEXT() ?! Regarding Py_UCS4_READ_NEXT() vs. Py_UNICODE_READ_NEXT(): the return value of the macro is a Py_UCS4 value, not a Py_UNICODE value. The first argument of the macro can be any array, not just Py_UNICODE*, but also Py_UCS4* or even int*. Py_UCS2_READ_NEXT() would be plain wrong :-) Also note that Python does have a Py_UCS4 type; it doesn't have a Py_UCS2 type. That's why we should use *Py_UCS4*_READ_NEXT(). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Raymond Hettinger wrote: Raymond Hettinger rhettin...@users.sourceforge.net added the comment: Mark, can you opine on this? Yes, I'll have a look later today. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: I like the idea and thanks for putting work into this. Some comments: * when using macro variables, always put the variables in parens in the expansion; this avoids precedence issues, weird syntax errors, etc. - even if it may not be necessary * a function would be cleaner, but since this code is very performance sensitive, I'd opt for the macro version, unless someone can prove that a function would be just as fast in benchmarks * the macros should be documented in the unicodeobject.h header file and clearly mention that ptr and end should be side-effect free and that ptr must an lvalue * please use the faster bitmask operators for joining surrogates, i.e. ucs4 = high 0x03FF) 10) | (low 0x03FF)) + 0x0001); * the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES() * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT() * in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-) * this version should be slightly faster and is also easier to read: #define Py_UCS4_READ_CODE_POINT(ptr, end) \ ((Py_UNICODE_ISHIGHSURROGATE((ptr)[0]) \ (ptr) (end) \ Py_UNICODE_ISLOWSURROGATE((ptr)[1])) ? \ Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \ (Py_UCS4)*(ptr)++) I haven't tested it, but you get the idea. BTW: You only focus on UCS2 builds. Please also make sure that these changes work on UCS4 builds, e.g. Py_UCS2_READ_CODE_POINT() will also work on UCS4 builds and join code points there. Note that UCS4 builds currently don't join surrogates, so a high and low surrogates appear as two code points, which they are, but given the experience with UCS2 builds, may not be what the user expects. So for the purpose of consistency we should be careful with auto-joining surrogates in UCS2. It does make sence for ord() and the various string methods, but should be done with care in other cases. In any case, we should clearly document where these macros are used and warn about the implications of using them in the wrong places. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg rep...@bugs.python.org wrote: .. [I'll respond to skipped when I update the patch] In any case, we should clearly document where these macros are used and warn about the implications of using them in the wrong places. It may be best to start with _Py_UCS2_READ_CODE_POINT() (BTW, I like the name because it naturally lead to Py_UCS2_WRITE_CODE_POINT() counterpart.) The leading underscore will probably not stop early adopters from using it and we may get some user feedback if they ask to make these macros public. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg rep...@bugs.python.org wrote: .. * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT() * in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-) I am not sure Py_UCS4_ prefix is right here. (I agree on *SURROGATE* methods.) The point of Py_UNICODE_NEXT(ptr, end) is that the pointers ptr and end are Py_UNICODE* and the macro expands to *p++ on wide builds. Maybe Py_UNICODE_NEXT_USC4? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Marc-Andre Lemburg m...@egenix.com added the comment: Alexander Belopolsky wrote: Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Sat, Nov 27, 2010 at 5:03 PM, Marc-Andre Lemburg rep...@bugs.python.org wrote: .. * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT() * in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-) I am not sure Py_UCS4_ prefix is right here. (I agree on *SURROGATE* methods.) The point of Py_UNICODE_NEXT(ptr, end) is that the pointers ptr and end are Py_UNICODE* and the macro expands to *p++ on wide builds. Maybe Py_UNICODE_NEXT_USC4? The idea is that the first part refers to what the macro returns (Py_UCS4) and the read part of the name refers to moving a pointer across an array (any array of integers). Note that the macro can also work on Py_UCS4 arrays (even in UCS2 builds), so it's universal in that respect. Perhaps we should allow ord() to work on surrogates in UCS4 builds as well. That would reduce the number of surprises. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Eric Smith e...@trueblade.com added the comment: The idea is that the first part refers to what the macro returns (Py_UCS4) and the read part of the name refers to moving a pointer across an array (any array of integers). I thought the first part generally meant the type of the first parameter. Although I can go either way, especially if we add an underscore. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: * the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES() * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT() I'm not so familiar with the prefix conventions, but wouldn't that lead users to think that this macro is for wide builds and that they have to use Py_UCS2_* macros for narrow builds? If these macros are supposed to abstract the build type maybe they should have a neutral prefix. (But if the conventions we use say otherwise I guess the best we can do is to document it properly). * in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-) The term code point is not entirely correct here. High and low surrogates are code points too. The right term should be 'scalar value' (but that might be confusing). The 'READ' bit sounds fine though, maybe 'READ_NEXT'? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Sat, Nov 27, 2010 at 5:41 PM, Ezio Melotti rep...@bugs.python.org wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: * the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES() * same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT() I'm not so familiar with the prefix conventions, but wouldn't that lead users to think that this macro is for wide builds and that they have to use Py_UCS2_* macros for narrow builds? If these macros are supposed to abstract the build type maybe they should have a neutral prefix. (But if the conventions we use say otherwise I guess the best we can do is to document it properly). When I was using the name, I did not think about argument type. Py_UNICODE_ is just the namespace prefix used by all macros in unicodeobject.h. Case in point: Py_UNICODE_ISALPHA() and family that take Py_UCS4. (I know, there is a historical reason at work here, but why fight it?) Functions use PyUnicode_ prefix and build specific functions use PyUnicodeUCSx_ prefix. As far as I can tell, there are no macros with Py_UCS4_ prefix. The choices I like in the order of preference are 1. Py_UNICODE_NEXT 2. Py_UNICODE_NEXT_UCS4 3. Py_UNICODE_READ_NEXT_UCS4 I can live with anything else, though. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Raymond Hettinger rhettin...@users.sourceforge.net added the comment: I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator protocol is being used. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Antoine Pitrou pit...@free.fr added the comment: I suggest Py_UNICODE_ADVANCE() to avoid false suggestion that the iterator protocol is being used. You can't use the iterator protocol on a non-PyObject, and Py_UNICODE_* (as opposed to PyUnicode_*) suggests the macro operates on a raw array of code points. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Ezio Melotti ezio.melo...@gmail.com added the comment: AFAIU the macro returns lone surrogates as they are, this means that: 1) if the string contains only surrogate pairs, Py_UNICODE_NEXT will iterate on scalar values[0]; 2) if the string contains only lone surrogates, it will iterate on codepoints[1]; 3) if it contains both it will be half and half (i.e. scalar values if the surrogates are in pair, or falling back on codepoints if they aren't); (for strings without surrogates, iterating on scalar values or codepoints is the same). Is this semantic correct for all (or at least most of) the places where the macro will be used? Would a stricter version (that rejects lone surrogates and iterates on scalar values only) be useful in addition or in alternative to Py_UNICODE_NEXT? [0]: http://unicode.org/glossary/#unicode_scalar_value [1]: http://unicode.org/glossary/#code_point -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: I am attaching a patch that defines Py_UNICODE_PUT_NEXT() macro (tentative name) and uses it to fix str.upper method. The implementation of surrogate-aware str.upper shows that NEXT/PUT_NEXT abstractions may lead to somewhat inefficient code for by codepoint processing. The issue is that once in in the process of reading the codepoint, it is determined whether the code point is BMP or non-BMP. Testing the result again in order to write it is somewhat wasteful. I don't think this would matter in practice, but would like to hear alternative opinions before moving further. (Please, don't argue over names - let's figure out the proper semantics first.) -- Added file: http://bugs.python.org/file19845/issue10542-put-next.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
New submission from Alexander Belopolsky belopol...@users.sourceforge.net: As discussed in issue 10521 and the sprawling len(chr(i)) = 2? thread [1] on python-dev, many functions in python library behave differently on narrow and wide builds. While there are unavoidable differences such as the length of strings with non-BMP characters, many functions can work around these differences. For example, the ord() function already produces integers over 0x when given a surrogate pair as a string of length two on a narrow build. Other functions such as str.isalpha(), are not yet aware of surrogates. See also issue9200. A consensus is developing that non-BMP characters support on narrow builds is here to stay and that naive functions should be fixed. Unfortunately, working with surrogates in python code is tricky because unicode C-API does not provide much support and existing examples of surrogate processing look like this: -while (u != uend w != wend) { -if (0xD800 = u[0] u[0] = 0xDBFF - 0xDC00 = u[1] u[1] = 0xDFFF) -{ -*w = (((u[0] 0x3FF) 10) | (u[1] 0x3FF)) + 0x1; -u += 2; -} -else { -*w = *u; -u++; -} -w++; -} The attached patch introduces a Py_UNICODE_NEXT() macro that allows replacing the code above with two lines: +while (u != uend w != wend) +*w++ = Py_UNICODE_NEXT(u, uend); The patch also introduces a set of macros for manipulating the surrogates, but I have not started replacing more instances of verbose surrogate processing because I would like to first look for higher level abstractions such as Py_UNICODE_NEXT(). For example, there are many instances that can benefit from Py_UNICODE_PUT_NEXT(ptr, ch) macro that would put a UCS4 character ch into Py_UNICODE buffer pointed by ptr and advance ptr by 1 or 2 units as necessary. [1] http://mail.python.org/pipermail/python-dev/2010-November/105908.html -- assignee: belopolsky components: Extension Modules, Interpreter Core, Unicode files: unicode-next.diff keywords: patch messages: 122464 nosy: Rhamphoryncus, amaury.forgeotdarc, belopolsky, eric.smith, ezio.melotti, lemburg, pitrou priority: normal severity: normal stage: patch review status: open title: Py_UNICODE_NEXT and other macros for surrogates type: feature request versions: Python 3.2 Added file: http://bugs.python.org/file19825/unicode-next.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Changes by Alexander Belopolsky belopol...@users.sourceforge.net: -- nosy: +haypo, loewis ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Eric Smith e...@trueblade.com added the comment: In addition to the proposed Py_UNICODE_NEXT and Py_UNICODE_PUT_NEXT, str.__format__ would also need a function that tells it how many Py_UNICODEs are needed to store a given Py_UCS4. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Nov 26, 2010 at 7:27 PM, Eric Smith rep...@bugs.python.org wrote: .. In addition to the proposed Py_UNICODE_NEXT and Py_UNICODE_PUT_NEXT, str.__format__ would also need a function that tells it how many Py_UNICODEs are needed to store a given Py_UCS4. Yes, this functionality is currently hidden in unicode_aswidechar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size): /* Helper function for PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(): convert a Unicode object to a wide character string. - If w is NULL: return the number of wide characters (including the nul character) required to convert the unicode object. Ignore size argument. .. */ and I believe is reimplemented in a few other places. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Eric Smith e...@trueblade.com added the comment: I'd need access to this without having to build a PyUnicodeObject, for efficiency. But it sounds like it does have the basic functionality I need. For my use I'd really need it to take the result of Py_UNICODE_NEXT. Something like: Py_ssize_t Py_UNICODE_NUM_NEEDED(Py_UCS4 c) and it would always return 1 or 2. Always 1 for a wide build, and for a narrow build 1 if c is in the BMP else 2. Choose a better name, of course. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Nov 26, 2010 at 7:45 PM, Eric Smith rep...@bugs.python.org wrote: .. For my use I'd really need it to take the result of Py_UNICODE_NEXT. Something like: Py_ssize_t Py_UNICODE_NUM_NEEDED(Py_UCS4 c) and it would always return 1 or 2. Always 1 for a wide build, and for a narrow build 1 if c is in the BMP else 2. Choose a better name, of course. Can you describe your use case in more detail? Would Py_UNICODE_PUT_NEXT() combined with Py_UNICODE_CODEPOINT_COUNT(Py_UNICODE *begin, Py_UNICODE *end) solve it? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
STINNER Victor victor.stin...@haypocalc.com added the comment: I don't like macro having a result and using multiple instructions using the evil magic trick (the ,). It's harder to maintain the code and harder to debug than a classical function. Don't you think that modern compilers are able to inline the code? (If not, we may add the right C attribute/keyword) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Eric Smith e...@trueblade.com added the comment: The code will basically be: Py_UCS4 fill; parse_format_string(fmt, ..., fill, ...); /* lots more code */ if (fill_needed) { /* compute how many characters to reserve */ space_needed = Py_UNICODE_NUM_NEEDED(fill) * number_of_characters_to_fill; } It would be most convenient (and require the fewest changes) if the computation could just use fill, instead of remembering the pointers to the beginning and end of fill. Py_UNICODE_CODEPOINT_COUNT could be implemented with a primitive that does what I want. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Nov 26, 2010 at 8:41 PM, STINNER Victor rep...@bugs.python.org wrote: .. I don't like macro having a result and using multiple instructions using the evil magic trick (the ,). It's harder to maintain the code and harder to debug than a classical function. You are preaching to the choir. In fact, my first version (issue10521-unicode-next.diff attached to issue10521) used a function. I would not worry about implementation at this point, though. Let's find the best abstraction first. Don't you think that modern compilers are able to inline the code? (If not, we may add the right C attribute/keyword) Not in C. In C++, I could use a reference to the pointer incremented by the macro, but in C, I have to use an address. Once you take an address of a variable, the compiler will refuse to put it in a register. So no, I don't think we can write an ANSI C function that will be as efficient as the macro. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Eric Smith e...@trueblade.com added the comment: The compiler's decision to inline something should not be related to its ability to put variables in a register. But I definitely agree that we should get the abstraction right first and worry about the implementation later. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Fri, Nov 26, 2010 at 9:22 PM, Eric Smith rep...@bugs.python.org wrote: .. But I definitely agree that we should get the abstraction right first and worry about the implementation later. I am fairly happy with Py_UNICODE_NEXT() abstraction. It's semantics should be natural for users familiar with python iterators and the fact that it expands to simply *ptr++ on wide builds makes it easy to explain its usage. I am note very happy about the end argument for the following reasons: 1. Builtin next() takes the default value as a second argument. Extension writers may expect the same from Py_UNICODE_NEXT(). The name end should be self-explainatory though, especially to those with an exposure to STL. 2. If Py_UNICODE_NEXT() stays as a macro, an innocent looking Py_UNICODE_NEXT(p, p + size) will have a hard to detect bug. Can be fixed by making Py_UNICODE_NEXT() a function. I wonder whether it is best to prefix the new macros with an underscore. On one hand, we want to make this available to extension writers, on the other hand, once more people start dealing with non-BMP issues, a better abstraction may be found and we man not want to maintain Py_UNICODE_NEXT() indefinitely. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Raymond, I wonder if you would like to comment on the iterator analogy and/or on adding public names to C API. -- nosy: +rhettinger ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10542] Py_UNICODE_NEXT and other macros for surrogates
Raymond Hettinger rhettin...@users.sourceforge.net added the comment: Mark, can you opine on this? -- assignee: belopolsky - lemburg ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10542 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com