[issue9036] Simplify Py_CHARMASK

2010-07-20 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

Antoine, Eric, thanks for looking at the patch. unicodeobject.c patch
committed in r82978, r82979, r82980 and r82984.

As Antoine pointed out, the _json.c issue is not a problem. Also, it is
not present in py3k. The reliable buildbots look ok, so I think we can
close this.

--
resolution:  - fixed
stage: patch review - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-07-19 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Antoine, I understood that you would prefer to leave the mask. Could I
 apply the second version of the patch?

It looks ok to me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-07-19 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

Antoine, thanks! Committed in r82966, r82968, r82969 and r82970.


Could we fix the unicodeobject.c bug on the fly? I think the patch should
do, unless non-ascii digits are supported. But in that case several other
changes would be needed.

--
Added file: http://bugs.python.org/file18065/unicodeobject_py_charmask.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-07-19 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Eric, is the unicode patch ok for you?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-07-19 Thread Eric Smith

Eric Smith e...@trueblade.com added the comment:

Yes, the unicode patch looks okay to me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-07-17 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

[Marc-Andre]
 Why not just make the casts in those cases explicit instead of
 using Py_CHARMASK ?

I agree that this would be the cleanest solution. It's harder to get
someone to review a larger patch though.


Antoine, I understood that you would prefer to leave the mask. Could I
apply the second version of the patch?

It will probably have an immediate benefit. On the ARM buildbot char
is unsigned and gcc emits a whole page of spurious 'array subscript has
type 'char' warnings.

--
Added file: http://bugs.python.org/file18044/py_charmask2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-06-24 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

Antoine Pitrou rep...@bugs.python.org wrote:
  Thus,
  ((unsigned char)((c)  0xff)) and ((unsigned char)(c)) should produce
  the same results.
 
 If it's the case, it's probably optimized away by the compiler anyway.

Yes, the asm output is the same. I was more concerned about readability.

Historically, Py_CHARMASK started as ((c)  0xff), hence the name. In r34974
UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added.

  There is no reason not to do the cast when __CHAR_UNSIGNED__ is
  defined (it will be a no-op).
 
 Agreed. It also reduces the opportunity for bugs :)

Ok, let's say we use ((unsigned char)((c)  0xff)) also for __CHAR_UNSIGNED__.

What should the comment say about the intended argument? I've examined all
occurrences in the tree and almost always Py_CHARMASK is used with either
char arguments or int arguments that are directly derived from a char, so
no EOF issues arise.

Exceptions:
===

Index: Objects/unicodeobject.c
===
--- Objects/unicodeobject.c (revision 82192)
+++ Objects/unicodeobject.c (working copy)
@@ -8417,6 +8417,7 @@
 else if (c = '0'  c = '9') {
 prec = c - '0';
 while (--fmtcnt = 0) {
+/* XXX: c and *fmt are Py_UNICODE */
 c = Py_CHARMASK(*fmt++);
 if (c  '0' || c  '9')
 break;

Index: Modules/_json.c
===
--- Modules/_json.c (revision 82192)
+++ Modules/_json.c (working copy)
@@ -603,6 +603,7 @@
 }
 }
 else {
+/* XXX: c is Py_UNICODE */
 char c_char = Py_CHARMASK(c);
 chunk = PyString_FromStringAndSize(c_char, 1);
 if (chunk == NULL) {

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-06-24 Thread Stefan Krah

Stefan Krah stefan-use...@bytereef.org added the comment:

 In r61936 the cast was added.

Actually r61987

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-06-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Stefan Krah wrote:
 
 Stefan Krah stefan-use...@bytereef.org added the comment:
 
 Antoine Pitrou rep...@bugs.python.org wrote:
 Thus,
 ((unsigned char)((c)  0xff)) and ((unsigned char)(c)) should produce
 the same results.

 If it's the case, it's probably optimized away by the compiler anyway.
 
 Yes, the asm output is the same. I was more concerned about readability.
 
 Historically, Py_CHARMASK started as ((c)  0xff), hence the name. In r34974
 UCHAR_MAX = 2**CHAR_BIT-1 = 255 was enforced. In r61936 the cast was added.
 
 There is no reason not to do the cast when __CHAR_UNSIGNED__ is
 defined (it will be a no-op).

 Agreed. It also reduces the opportunity for bugs :)
 
 Ok, let's say we use ((unsigned char)((c)  0xff)) also for __CHAR_UNSIGNED__.
 
 What should the comment say about the intended argument? I've examined all
 occurrences in the tree and almost always Py_CHARMASK is used with either
 char arguments or int arguments that are directly derived from a char, so
 no EOF issues arise.

Why not just make the casts in those cases explicit instead of
using Py_CHARMASK ?

 Exceptions:
 ===
 
 Index: Objects/unicodeobject.c
 ===
 --- Objects/unicodeobject.c (revision 82192)
 +++ Objects/unicodeobject.c (working copy)
 @@ -8417,6 +8417,7 @@
  else if (c = '0'  c = '9') {
  prec = c - '0';
  while (--fmtcnt = 0) {
 +/* XXX: c and *fmt are Py_UNICODE */
  c = Py_CHARMASK(*fmt++);

If both are Py_UNICODE, why do you need the cast at all ?

Note that the above use also appears to be wrong, since if *fmt
is Py_UNICODE, the following if() will also match if the higher
order bytes of the Py_UNICODE value are non-0.

  if (c  '0' || c  '9')
  break;
 
 Index: Modules/_json.c
 ===
 --- Modules/_json.c (revision 82192)
 +++ Modules/_json.c (working copy)
 @@ -603,6 +603,7 @@
  }
  }
  else {
 +/* XXX: c is Py_UNICODE */
  char c_char = Py_CHARMASK(c);

In this case you should get a compiler warning due to the
different signedness of the arguments.

Same comment as above: the high order bytes of c are lost
in this conversion.

Why can't _json.c use one of the existing Unicode conversion
APIs ?

  chunk = PyString_FromStringAndSize(c_char, 1);
  if (chunk == NULL) {
 
 --
 
 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue9036
 ___
 ___
 Python-bugs-list mailing list
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-06-24 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Ok, let's say we use ((unsigned char)((c)  0xff)) also for
 __CHAR_UNSIGNED__.
 
 What should the comment say about the intended argument?

That it's either in [-128; 127] or in [0; 255] ?

 Index: Objects/unicodeobject.c
 ===
 --- Objects/unicodeobject.c (revision 82192)
 +++ Objects/unicodeobject.c (working copy)
 @@ -8417,6 +8417,7 @@
  else if (c = '0'  c = '9') {
  prec = c - '0';
  while (--fmtcnt = 0) {
 +/* XXX: c and *fmt are Py_UNICODE */
  c = Py_CHARMASK(*fmt++);

This is a funny bug:

 u%.1\u1032f % (1./3)
u'0.'

 Index: Modules/_json.c
 ===
 --- Modules/_json.c (revision 82192)
 +++ Modules/_json.c (working copy)
 @@ -603,6 +603,7 @@
  }
  }
  else {
 +/* XXX: c is Py_UNICODE */
  char c_char = Py_CHARMASK(c);

This block can only be entered if c = 0x7f (`has_unicode` is false), so it's 
not a problem.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-06-20 Thread Stefan Krah

New submission from Stefan Krah stefan-use...@bytereef.org:

This feature request is extracted from issue 9020, where Py_CHARMASK(c)
was used on EOF, producing different results depending on whether 
__CHAR_UNSIGNED__ is defined or not.

The preferred fix for issue 9020 is to check for EOF before using
Py_CHARMASK, so this is only loosely related.


I've looked through the source tree to determine how Py_CHARMASK
is meant to be used. It seems that it is only used in situations
where one would actually want to cast a char to an unsigned char,
like isspace((unsigned char)c).


Simplifications:

1) Python.h already enforces that an unsigned char is 8 bit wide. Thus,
   ((unsigned char)((c)  0xff)) and ((unsigned char)(c)) should produce
   the same results.

2) There is no reason not to do the cast when __CHAR_UNSIGNED__ is
   defined (it will be a no-op).

--
components: Interpreter Core
files: py_charmask.patch
keywords: needs review, patch
messages: 108234
nosy: eric.smith, loewis, pitrou, skrah
priority: normal
severity: normal
stage: patch review
status: open
title: Simplify Py_CHARMASK
type: feature request
versions: Python 2.7, Python 3.1, Python 3.2
Added file: http://bugs.python.org/file17724/py_charmask.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9036] Simplify Py_CHARMASK

2010-06-20 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Thus,
 ((unsigned char)((c)  0xff)) and ((unsigned char)(c)) should produce
 the same results.

If it's the case, it's probably optimized away by the compiler anyway.

 There is no reason not to do the cast when __CHAR_UNSIGNED__ is
 defined (it will be a no-op).

Agreed. It also reduces the opportunity for bugs :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9036
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com