[issue9036] Simplify Py_CHARMASK
Stefan Krah stefan-use...@bytereef.org added the comment: Antoine, Eric, thanks for looking at the patch. unicodeobject.c patch committed in r82978, r82979, r82980 and r82984. As Antoine pointed out, the _json.c issue is not a problem. Also, it is not present in py3k. The reliable buildbots look ok, so I think we can close this. -- resolution: - fixed stage: patch review - committed/rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Antoine Pitrou pit...@free.fr added the comment: Antoine, I understood that you would prefer to leave the mask. Could I apply the second version of the patch? It looks ok to me. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Stefan Krah stefan-use...@bytereef.org added the comment: Antoine, thanks! Committed in r82966, r82968, r82969 and r82970. Could we fix the unicodeobject.c bug on the fly? I think the patch should do, unless non-ascii digits are supported. But in that case several other changes would be needed. -- Added file: http://bugs.python.org/file18065/unicodeobject_py_charmask.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Antoine Pitrou pit...@free.fr added the comment: Eric, is the unicode patch ok for you? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Eric Smith e...@trueblade.com added the comment: Yes, the unicode patch looks okay to me. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Stefan Krah stefan-use...@bytereef.org added the comment: [Marc-Andre] Why not just make the casts in those cases explicit instead of using Py_CHARMASK ? I agree that this would be the cleanest solution. It's harder to get someone to review a larger patch though. Antoine, I understood that you would prefer to leave the mask. Could I apply the second version of the patch? It will probably have an immediate benefit. On the ARM buildbot char is unsigned and gcc emits a whole page of spurious 'array subscript has type 'char' warnings. -- Added file: http://bugs.python.org/file18044/py_charmask2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Stefan Krah stefan-use...@bytereef.org added the comment: Antoine Pitrou rep...@bugs.python.org wrote: Thus, ((unsigned char)((c) 0xff)) and ((unsigned char)(c)) should produce the same results. If it's the case, it's probably optimized away by the compiler anyway. Yes, the asm output is the same. I was more concerned about readability. Historically, Py_CHARMASK started as ((c) 0xff), hence the name. In r34974 UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added. There is no reason not to do the cast when __CHAR_UNSIGNED__ is defined (it will be a no-op). Agreed. It also reduces the opportunity for bugs :) Ok, let's say we use ((unsigned char)((c) 0xff)) also for __CHAR_UNSIGNED__. What should the comment say about the intended argument? I've examined all occurrences in the tree and almost always Py_CHARMASK is used with either char arguments or int arguments that are directly derived from a char, so no EOF issues arise. Exceptions: === Index: Objects/unicodeobject.c === --- Objects/unicodeobject.c (revision 82192) +++ Objects/unicodeobject.c (working copy) @@ -8417,6 +8417,7 @@ else if (c = '0' c = '9') { prec = c - '0'; while (--fmtcnt = 0) { +/* XXX: c and *fmt are Py_UNICODE */ c = Py_CHARMASK(*fmt++); if (c '0' || c '9') break; Index: Modules/_json.c === --- Modules/_json.c (revision 82192) +++ Modules/_json.c (working copy) @@ -603,6 +603,7 @@ } } else { +/* XXX: c is Py_UNICODE */ char c_char = Py_CHARMASK(c); chunk = PyString_FromStringAndSize(c_char, 1); if (chunk == NULL) { -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Stefan Krah stefan-use...@bytereef.org added the comment: In r61936 the cast was added. Actually r61987 -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Marc-Andre Lemburg m...@egenix.com added the comment: Stefan Krah wrote: Stefan Krah stefan-use...@bytereef.org added the comment: Antoine Pitrou rep...@bugs.python.org wrote: Thus, ((unsigned char)((c) 0xff)) and ((unsigned char)(c)) should produce the same results. If it's the case, it's probably optimized away by the compiler anyway. Yes, the asm output is the same. I was more concerned about readability. Historically, Py_CHARMASK started as ((c) 0xff), hence the name. In r34974 UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added. There is no reason not to do the cast when __CHAR_UNSIGNED__ is defined (it will be a no-op). Agreed. It also reduces the opportunity for bugs :) Ok, let's say we use ((unsigned char)((c) 0xff)) also for __CHAR_UNSIGNED__. What should the comment say about the intended argument? I've examined all occurrences in the tree and almost always Py_CHARMASK is used with either char arguments or int arguments that are directly derived from a char, so no EOF issues arise. Why not just make the casts in those cases explicit instead of using Py_CHARMASK ? Exceptions: === Index: Objects/unicodeobject.c === --- Objects/unicodeobject.c (revision 82192) +++ Objects/unicodeobject.c (working copy) @@ -8417,6 +8417,7 @@ else if (c = '0' c = '9') { prec = c - '0'; while (--fmtcnt = 0) { +/* XXX: c and *fmt are Py_UNICODE */ c = Py_CHARMASK(*fmt++); If both are Py_UNICODE, why do you need the cast at all ? Note that the above use also appears to be wrong, since if *fmt is Py_UNICODE, the following if() will also match if the higher order bytes of the Py_UNICODE value are non-0. if (c '0' || c '9') break; Index: Modules/_json.c === --- Modules/_json.c (revision 82192) +++ Modules/_json.c (working copy) @@ -603,6 +603,7 @@ } } else { +/* XXX: c is Py_UNICODE */ char c_char = Py_CHARMASK(c); In this case you should get a compiler warning due to the different signedness of the arguments. Same comment as above: the high order bytes of c are lost in this conversion. Why can't _json.c use one of the existing Unicode conversion APIs ? chunk = PyString_FromStringAndSize(c_char, 1); if (chunk == NULL) { -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com -- nosy: +lemburg ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Antoine Pitrou pit...@free.fr added the comment: Ok, let's say we use ((unsigned char)((c) 0xff)) also for __CHAR_UNSIGNED__. What should the comment say about the intended argument? That it's either in [-128; 127] or in [0; 255] ? Index: Objects/unicodeobject.c === --- Objects/unicodeobject.c (revision 82192) +++ Objects/unicodeobject.c (working copy) @@ -8417,6 +8417,7 @@ else if (c = '0' c = '9') { prec = c - '0'; while (--fmtcnt = 0) { +/* XXX: c and *fmt are Py_UNICODE */ c = Py_CHARMASK(*fmt++); This is a funny bug: u%.1\u1032f % (1./3) u'0.' Index: Modules/_json.c === --- Modules/_json.c (revision 82192) +++ Modules/_json.c (working copy) @@ -603,6 +603,7 @@ } } else { +/* XXX: c is Py_UNICODE */ char c_char = Py_CHARMASK(c); This block can only be entered if c = 0x7f (`has_unicode` is false), so it's not a problem. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
New submission from Stefan Krah stefan-use...@bytereef.org: This feature request is extracted from issue 9020, where Py_CHARMASK(c) was used on EOF, producing different results depending on whether __CHAR_UNSIGNED__ is defined or not. The preferred fix for issue 9020 is to check for EOF before using Py_CHARMASK, so this is only loosely related. I've looked through the source tree to determine how Py_CHARMASK is meant to be used. It seems that it is only used in situations where one would actually want to cast a char to an unsigned char, like isspace((unsigned char)c). Simplifications: 1) Python.h already enforces that an unsigned char is 8 bit wide. Thus, ((unsigned char)((c) 0xff)) and ((unsigned char)(c)) should produce the same results. 2) There is no reason not to do the cast when __CHAR_UNSIGNED__ is defined (it will be a no-op). -- components: Interpreter Core files: py_charmask.patch keywords: needs review, patch messages: 108234 nosy: eric.smith, loewis, pitrou, skrah priority: normal severity: normal stage: patch review status: open title: Simplify Py_CHARMASK type: feature request versions: Python 2.7, Python 3.1, Python 3.2 Added file: http://bugs.python.org/file17724/py_charmask.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9036] Simplify Py_CHARMASK
Antoine Pitrou pit...@free.fr added the comment: Thus, ((unsigned char)((c) 0xff)) and ((unsigned char)(c)) should produce the same results. If it's the case, it's probably optimized away by the compiler anyway. There is no reason not to do the cast when __CHAR_UNSIGNED__ is defined (it will be a no-op). Agreed. It also reduces the opportunity for bugs :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9036 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com