[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2012-01-05 Thread Benjamin Peterson

Benjamin Peterson  added the comment:

I'm just going to close this and say "use 3.3".

--
nosy: +benjamin.peterson
resolution:  -> out of date
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2011-09-29 Thread Ezio Melotti

Ezio Melotti  added the comment:

It can still be fixed on 2.7/3.2 though.

--
versions: +Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2011-09-29 Thread STINNER Victor

STINNER Victor  added the comment:

This issue has been fixed in Python 3.3 thanks to the PEP 393.

--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-27 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

After reading the additional messages here and on a similar issue Alexander 
opened after this, I seem the point of wanting to make the difference between 
the two types of builds as transparent as sensibly possible. From that 
viewpoint, rejection of composed chars is not as bad because both types of 
builds act the same.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-27 Thread Ezio Melotti

Ezio Melotti  added the comment:

I agree that s.center(char, n).encode('utf-8') should be the same on both the 
builds -- even if their len() will be different -- for the following reasons:

1) the string will eventually be encoded, and if they the result is the same on 
both builds, it will look the same too;
2) trying to keep the same len() will generate different results and it won't 
work in case of odd width like 'foo'.center(surrogate_pair, 5) because you 
can't put half surrogate.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-26 Thread Eric Smith

Eric Smith  added the comment:

I think these macros would be a reasonable approach. I think str.center, etc. 
should support non-BMP chars, because to not do so can raise an exception. 
Supporting composed graphemes seems like another problem altogether. And while 
we could fix that, it's clearly a larger step.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-26 Thread Alexander Belopolsky

Alexander Belopolsky  added the comment:

On Fri, Nov 26, 2010 at 6:37 PM, Terry J. Reedy  wrote:
>
> Terry J. Reedy  added the comment:
>
> As a practical matter, I think that for at least the next decade, people are 
> at least as likely to
> want to fill with a composed, multi-BMP-codepoint 'char' (grapheme) as with a 
> non-BMP char.
> So to me, failure with the latter is no worse than failure with the former.
>

I disagree. '\N{AEGEAN WORD SEPARATOR DOT}'  ('𐄁') looks like a
reasonably shaped fill character, while say 'Z\N{COMBINING ACUTE
ACCENT}\N{COMBINING GRAVE ACCENT}' ('ΕΉΜ€') does not.  Yet this is not
the point of this bug report.  The point is that Python user should
not care (much) about how many bytes per character Python uses under
the hood or what is the numeric value of the character that she can
enter in her program.

> The underlying problem is that centering k chars within n spaces with fill i 
> is based
> on one-char per code encodings *and* fixed pitch fonts with one-char per 
> space.

No. ' Section Title '.center(40, '*') will look good regardless of
font width and even more so when combined with  tag or its
equivalent in a given application.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-26 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

As a practical matter, I think that for at least the next decade, people are at 
least as likely to want to fill with a composed, multi-BMP-codepoint 'char' 
(grapheme) as with a non-BMP char. So to me, failure with the latter is no 
worse than failure with the former.

The underlying problem is that centering k chars within n spaces with fill i is 
based on one-char per code encodings *and* fixed pitch fonts with one-char per 
space. That model is not universally applicable, so I do not consider it a bug 
that functions based on that model are also not universally applicable. Perhaps 
docs should be clearer about the limitations of many of the string methods in 
the new context.

A full general solution to the general problem of centering requires a shift to 
physical units (points or mm) and detailed font information, including kerning. 
This is beyond the scope of a string method.

So I consider this a feature request for a partial generalization of unclear 
utility and unclear definition.

--
nosy: +terry.reedy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Amaury Forgeot d'Arc

Amaury Forgeot d'Arc  added the comment:

issue9200 already proposes a similar change to str.is* methods.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +amaury.forgeotdarc

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Alexander Belopolsky

Alexander Belopolsky  added the comment:

Here is another proof of concept patch for the isalpha issue that introduces a 
higher level abstraction macro - Py_UNICODE_NEXT.  It should be possible to 
reuse this macro in all isxyz methods and other places where surrogates are 
currently processed.  I should be possible to come up with a pure macro 
definition of Py_UNICODE_NEXT.

--
Added file: http://bugs.python.org/file19810/issue10521-unicode-next.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Ezio Melotti

Ezio Melotti  added the comment:

I think that methods like str.isalpha can and should be fixed. Since 
_PyUnicode_IsAlpha now accepts a Py_UCS4, the body of unicode_isalpha can be 
changed to convert normal chars and surrogates pairs to a Py_UCS4 before 
calling Py_UNICODE_ISALPHA.
The attached patch is a proof of concept of this approach and returns True for 
'\N{OLD ITALIC LETTER A}'.isalpha() on a narrow build.
It still has a number of issues that should be addressed (check for narrow 
builds, check for lone surrogates, check for high surrogate at the end of a 
string, fix compiler warnings ...) but it should be good enough as a PoC.

I would also suggest to introduce a set of macros to handle surrogates (e.g. 
detect, combine) and use it in all the functions that need to work with them.

--
keywords: +patch
Added file: http://bugs.python.org/file19809/issue10521-isalpha.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Alexander Belopolsky

Alexander Belopolsky  added the comment:

Here is another str method not ready for non-BMP chars:


>>> u = '\U00010140'
>>> u.translate({ord(u):ord('A')})
'𐅀'

(expected 'A')

>>> u = 'B'
>>> u.translate({ord(u):ord('A')})
'A'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Alexander Belopolsky

Changes by Alexander Belopolsky :


--
Removed message: http://bugs.python.org/msg122313

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Alexander Belopolsky

Alexander Belopolsky  added the comment:

On Wed, Nov 24, 2010 at 3:37 PM, Marc-Andre Lemburg
 wrote:
..
> I don't think we should change that for the formatting methods.

That's a reasonable position.  What about

>>> unicodedata.category('\N{OLD ITALIC LETTER A}')
'Lo'
>>> '\N{OLD ITALIC LETTER A}'.isalpha()
False

the str.isalpha() method is underspecified in the reference manual,
but a comment in unicodectype.c describes Py_UNICODE_ISALPHA as
follows:

/* Returns 1 for Unicode characters having the category 'Ll', 'Lu',
'Lt',
  'Lo' or 'Lm',  0 otherwise. */

I don't have a wide build handy, but I am fairly sure  '\N{OLD ITALIC
LETTER A}'.isalpha() would produce True there.  The result above is
simply consequence of surrogates considered to be non-letters:

>>> [c.isalpha() for c in '\N{OLD ITALIC LETTER A}']
[False, False]

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Alexander Belopolsky

Alexander Belopolsky  added the comment:

On Wed, Nov 24, 2010 at 3:37 PM, Marc-Andre Lemburg
 wrote:
..
> I don't think we should change that for the formatting methods.

That's a reasonable position.  What about

'Lo'
>>> '\N{OLD ITALIC LETTER A}'.isalpha()
False

the str.isalpha() method is underspecified in the reference manual,
but a comment in unicodectype.c describes Py_UNICODE_ISALPHA as
follows:

/* Returns 1 for Unicode characters having the category 'Ll', 'Lu',
'Lt',
   'Lo' or 'Lm',  0 otherwise. */

I don't have a wide build handy, but I am fairly sure  '\N{OLD ITALIC
LETTER A}'.isalpha() would produce True there.  The result above is
simply consequence of surrogates considered to be non-letters:

[False, False]

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

Alexander Belopolsky wrote:
> 
> New submission from Alexander Belopolsky :
> 
 'xyz'.center(20, '\U00100140')
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: The fill character must be exactly one character long
> 
> str.ljust and str.rjust are similarly affected.

I don't think we should change that for the formatting methods.

See my reply on python-dev:

str.center(n) centers the string in a padded string that
is composed of n code units. Whether that operation will result
in a text that's centered visually on output is a completely
different story. The original string could contain surrogates,
it could also contain combing code points, so the visual
presentation of the result may very well not be centered at
all; it may not even appear as having the length n to the user.

Since we're not going change the semantics of those APIs,
it is OK to not support padding with non-BMP code points on
UCS-2 builds.

Supporting such cases would only cause problems:

* if the methods would pad with surrogates, the resulting
  string would no longer have length n; breaking the
  assumption that len(str.center(n)) == n

* if the methods would pad with half the number of surroagtes
  to make sure that len(str.center(n)) == n, the resulting
  output to e.g. a terminal would be further off, than what
  you already have with surrogates and combining code points
  in the original string.

--
nosy: +lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Alexander Belopolsky

Alexander Belopolsky  added the comment:

On Wed, Nov 24, 2010 at 10:33 AM, Antoine Pitrou  wrote:
..
> The question is, what should it do with such an input?

I think the rule for such functions should be that if
input.encode('utf-8') is the same on wide and narrow builds, then the
output.encode('utf-8') should be the same.

> Pretend it's a single char (but other chars in the source string won't get 
> the same treatment)?

Yes, *and* surrogate pairs in the source string should count for one
char as well.

> Treat it as a two-char string (but then center() and friends should logically 
> be
> extended to accept strings of arbitrary lengths)?

No.  For better or worse, on wide builds these methods effectively
operate on code points.  They don't interpret multi-code-point-
graphemes or take grapheme width into account:


​123


Application code has to ascertain that it is dealing with with fixed
width characters in the target font before using these methods for
text alignment.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Eric Smith

Eric Smith  added the comment:

str.__format__ and friends (int, float, complex) also have this same problem. 
For example, when they're computing the "fill" character:

>>> format('', 'x^')
''

>>> format('', '\U00100140^')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: Invalid conversion specification

--
nosy: +eric.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

The question is, what should it do with such an input? Pretend it's a single 
char (but other chars in the source string won't get the same treatment)? Treat 
it as a two-char string (but then center() and friends should logically be 
extended to accept strings of arbitrary lengths)?

--
nosy: +pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

2010-11-24 Thread Alexander Belopolsky

New submission from Alexander Belopolsky :

>>> 'xyz'.center(20, '\U00100140')
Traceback (most recent call last):
  File "", line 1, in 
TypeError: The fill character must be exactly one character long

str.ljust and str.rjust are similarly affected.

--
components: Interpreter Core
messages: 122280
nosy: belopolsky
priority: normal
severity: normal
stage: needs patch
status: open
title: str methods don't accept non-BMP fillchar on a narrow Unicode build
type: behavior
versions: Python 3.2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com