Glyph Lefkowitz writes:
But I don't think that anyone is filling up main memory with
gigantic piles of character indexes and need to squeeze out that
extra couple of bytes of memory on such a tiny object.
How do you think editors and browsers represent the regions that they
highlight,
Terry Reedy wrote:
On 11/24/2010 3:06 PM, Alexander Belopolsky wrote:
Any non-trivial text processing is likely to be broken in presence of
surrogates. Producing them on input is just trading known issue for
an unknown one. Processing surrogate pairs in python code is hard.
Software that
Alexander Belopolsky wrote:
On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull step...@xemacs.org
wrote:
..
I note that an opinion has been raised on this thread that
if we want compressed internal representation for strings, we should
use UTF-8. I tend to agree, but UTF-8 has been
On Friday 19 November 2010 23:25:03 you wrote:
Python is unclear about non-BMP characters: narrow build was called
ucs2 for long time, even if it is UTF-16 (each character is encoded to
one or two UTF-16 words).
No, no, no :-)
UCS2 and UCS4 are more appropriate than narrow and wide or
M.-A. Lemburg writes:
That would be a possibility as well... but I doubt that many users
are going to bother, since slicing surrogates is just as bad as
slicing combining code points and the latter are much more common in
real life and they do happen to mostly live in the BMP.
That's
M.-A. Lemburg writes:
Please note that we can only provide one way of string indexing
in Python using the standard s[1] notation and since we don't
want that operation to be fast and no more than O(1), using the
code units as items is the only reasonable way to implement it.
AFAICT, the
On Nov 24, 2010, at 4:03 AM, Stephen J. Turnbull wrote:
You end up proliferating types that all do the same kind of thing. Judicious
use of inheritance helps, but getting the fundamental abstraction right is
hard. Or least, Emacs hasn't found it in 20 years of trying.
Emacs hasn't even
On Nov 24, 2010, at 10:55 PM, Stephen J. Turnbull wrote:
Greg Ewing writes:
On 24/11/10 22:03, Stephen J. Turnbull wrote:
But
if you actually need to remember positions, or regions, to jump to
later or to communicate to other code that manipulates them, doing
this stuff the straightforward
James Y Knight writes:
a) You seem to be hung up implementation details of emacs.
Hung up? No. It's the program whose text model I know best, and even
if its design could theoretically be a lot better for this purpose, I
can't say I've seen a real program whose model is obviously better for
James Y Knight writes:
But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly
superior [...]a because it is an ASCII superset, and thus more
easily compatible with other software. That also makes it most
commonly used for internet communication.
Sure, UTF-8 is very nice as a
On Wed, 24 Nov 2010 18:51:49 +0900
Stephen J. Turnbull step...@xemacs.org wrote:
James Y Knight writes:
But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly
superior [...]a because it is an ASCII superset, and thus more
easily compatible with other software. That also makes
On Tue, Nov 23, 2010 at 2:18 PM, Amaury Forgeot d'Arc
amaur...@gmail.com wrote:
..
Given the apparent difficulty of writing even basic text processing
algorithms in presence of surrogate pairs, I wonder how wise it is to
expose Python users to them.
This was already discussed two years ago:
Alexander Belopolsky wrote:
To conclude, I feel that rather than trying to fully support non-BMP
characters as surrogate pairs in narrow builds, we should make it
easier for application developers to avoid them.
I don't understand what you're after here. Programmers can easily
avoid them by
On Wed, Nov 24, 2010 at 1:50 PM, M.-A. Lemburg m...@egenix.com wrote:
..
add an option for decoders that currently produce surrogate pairs to
treat non-BMP characters as errors and handle them according to user's
choice.
But what do you gain by doing this ? You'd lose the round-trip
safety
On 24/11/10 13:22, James Y Knight wrote:
Instead, provide bidirectional iterators which can traverse the string by byte,
codepoint, or by grapheme
Maybe it would be a good idea to add some iterators like this
to Python. (Or has the time machine beaten me there?)
--
Greg
Alexander Belopolsky writes:
Any non-trivial text processing is likely to be broken in presence of
surrogates.
If you're worried about this, write a UCS-2-producing codec that
rejects surrogates or stuffs them into the private zone of the BMP.
Maybe such a codec should be default, but so far
On 24/11/10 22:03, Stephen J. Turnbull wrote:
But
if you actually need to remember positions, or regions, to jump to
later or to communicate to other code that manipulates them, doing
this stuff the straightforward way (just copying the whole iterator
object to hang on to its state) becomes
On 25/11/10 06:37, Alexander Belopolsky wrote:
I don't think there is a recipe on how to fix legacy
character-by-character processing loop such as
for c in string:
...
to make it iterate over code points consistently in wide and narrow
builds.
A couple of possibilities:
1) Make
Greg Ewing writes:
On 24/11/10 22:03, Stephen J. Turnbull wrote:
But
if you actually need to remember positions, or regions, to jump to
later or to communicate to other code that manipulates them, doing
this stuff the straightforward way (just copying the whole iterator
object to
On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull step...@xemacs.org wrote:
..
I note that an opinion has been raised on this thread that
if we want compressed internal representation for strings, we should
use UTF-8. I tend to agree, but UTF-8 has been repeatedly rejected as
too
On 11/24/2010 3:06 PM, Alexander Belopolsky wrote:
Any non-trivial text processing is likely to be broken in presence of
surrogates. Producing them on input is just trading known issue for
an unknown one. Processing surrogate pairs in python code is hard.
Software that has to support non-BMP
Terry Reedy writes:
Yes. As I read the standard, UCS-2 is limited to BMP chars.
Et tu, Terry?
OK, I change my vote on the suggestion of UCS2 to -1. If a couple
of conscientious blokes like you and David both understand it that
way, I can't see any way to fight it.
FWIW, ISO/IEC 10646 (which
If you don't care about the ISO standard, but only about Python,
Martin's right, I was wrong. You can stop reading now.wink
Martin v. Löwis writes:
I could only find the FCD of 10646:2010, where annex H was integrated
into section 10:
Thank you for the reference.
I referred to two older
Martin v. Löwis writes:
I disagree: Quoting from Unicode 5.0, section 5.4:
# The individual components of implementations may have different
# levels of support for surrogates, as long as those components are
# assembled and communicate correctly.
Assembly is the problem. If chr() or
Nick Coghlan writes:
For practical purposes, UCS2/UCS4 convey far more inherent information
than narrow/wide:
That was my stance, but in fact (1) the ISO JTC1/SC2 has deliberately
made them ambiguous by changing their definitions over the years[1],
and (2) the more recent definitions and
On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger
raymond.hettin...@gmail.com wrote:
..
Any explanation we give users needs to let them know two things:
* that we cover the entire range of unicode not just BMP
* that sometimes len(chr(i)) is one and sometimes two
This discussion motivated me
2010/11/23 Alexander Belopolsky alexander.belopol...@gmail.com:
This discussion motivated me to start looking into how well Python
library itself is prepared to deal with len(chr(i)) = 2. I was not
surprised to find that textwrap does not handle the issue that well:
len(wrap(' \U00010140' *
Alexander Belopolsky wrote:
On Mon, Nov 22, 2010 at 1:13 PM, Raymond Hettinger
raymond.hettin...@gmail.com wrote:
..
Any explanation we give users needs to let them know two things:
* that we cover the entire range of unicode not just BMP
* that sometimes len(chr(i)) is one and sometimes two
On 11/23/2010 2:11 PM, Alexander Belopolsky wrote:
This discussion motivated me to start looking into how well Python
library itself is prepared to deal with len(chr(i)) = 2. I was not
Good idea!
surprised to find that textwrap does not handle the issue that well:
len(wrap(' \U00010140'
Alexander Belopolsky wrote:
Because the most commonly used characters are all in the Basic
Multilingual Plane, converting between surrogate pairs and the
original values is often not tested thoroughly. This leads to
persistent bugs, and potential security holes, even in popular and
On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote:
Maybe Python should have used UTF-8 as its internal unicode
representation. Then people who were foolish enough to assume
one character per string item would have their programs break
rather soon under only light unicode testing. :-)
You put a
On Nov 23, 2010, at 7:22 PM, James Y Knight wrote:
On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote:
Maybe Python should have used UTF-8 as its internal unicode
representation. Then people who were foolish enough to assume
one character per string item would have their programs break
rather
Alexander Belopolsky writes:
Yet finding a bug in a str object method after a 5 min review was a
bit discouraging:
'xyz'.center(20, '\U00010140')
Traceback (most recent call last):
File stdin, line 1, in module
TypeError: The fill character must be exactly one character long
James Y Knight writes:
You put a smiley, but, in all seriousness, I think that's actually
the right thing to do if anyone writes a new programming
language. It is clearly the right thing if you don't have to be
concerned with backwards-compatibility: nobody really needs to be
able to
On Nov 23, 2010, at 9:44 PM, Stephen J. Turnbull wrote:
James Y Knight writes:
You put a smiley, but, in all seriousness, I think that's actually
the right thing to do if anyone writes a new programming
language. It is clearly the right thing if you don't have to be
concerned with
Note that I'm not saying that there shouldn't be a UTF-8 string type;
I'm just saying that for some purposes it might be a good idea to keep
UTF-16 and UTF-32 string types around.
Glyph Lefkowitz writes:
The theory is that accessing the first character of a region in a
string often occurs
On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote:
Or you can give user programs memory indicies, and enjoy the fun as
the poor developers do things like pos += 1 which works fine on
the ASCII data they have lying around, then wonder why they get
Unicode errors when they take substrings.
On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote:
By the way, to send the ball back into your court, I have this feeling
that the demand for UTF-8 is once again driven by native English
speakers who are very shortly going to find themselves, and the data
they are most familiar with,
Unicode 5.0, Chapter 3, verse C9:
When a process generates a code unit sequence which purports to be
in a Unicode character encoding form, it shall not emit ill-formed
code sequences.
A Unicode-conforming Python implementation would error at the
chr() call, or perhaps
Martin v. Löwis writes:
More interestingly (and to the subject) is chr: how did you arrive
at C9 banning Python3's definition of chr? This chr function puts
the code sequence into well-formed UTF-16; that's the whole point of
UTF-16.
No, it doesn't, in the specific case of surrogate code
Raymond Hettinger writes:
Neither UTF-16 nor UCS-2 is exactly correct anyway.
From a standards lawyer point of view, UCS-2 is exactly correct, as
far as I can tell upon rereading ISO 10646-1, especially Annexes H
(retransmitting devices) and Q (UTF-16). Annex Q makes it clear
that UTF-16 was
Am 22.11.2010 11:47, schrieb Stephen J. Turnbull:
Martin v. Löwis writes:
More interestingly (and to the subject) is chr: how did you arrive
at C9 banning Python3's definition of chr? This chr function puts
the code sequence into well-formed UTF-16; that's the whole point of
UTF-16.
Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:
Raymond Hettinger writes:
Neither UTF-16 nor UCS-2 is exactly correct anyway.
From a standards lawyer point of view, UCS-2 is exactly correct, as
far as I can tell upon rereading ISO 10646-1, especially Annexes H
(retransmitting devices)
Martin,
it is really irrelevant whether the standards have decided
to no longer use the terms UCS-2 and UCS-4 in their latest
standard documents.
The definitions still stand (just like Unicode 2.0 is still a valid
standard, even if it's ten years old):
* UCS-2 is defined as Universal Character
Why don't ya'll just call them --unichar-width=16/32. That describes
precisely what the options do, and doesn't invite any quibbling over
definitions.
James
___
Python-Dev mailing list
Python-Dev@python.org
On Mon, Nov 22, 2010 at 10:47 PM, M.-A. Lemburg m...@egenix.com wrote:
Please also note that we have used the terms UCS-2 and UCS-4 in Python2
for 9+ years now and users are just starting to learn the difference
and get acquainted with the fact that Python uses these two forms.
Confronting
On Mon, Nov 22, 2010 at 10:37 AM, Nick Coghlan ncogh...@gmail.com wrote:
..
*(The first Google hit for ucs2 is the UTF-16/UCS-2 article on
Wikipedia, the first hit for ucs4 is the UTF-32/UCS-4 article)
Do you think these articles are helpful for someone learning how to
use chr() and ord() in
On Tue, Nov 23, 2010 at 2:03 AM, Alexander Belopolsky
alexander.belopol...@gmail.com wrote:
On Mon, Nov 22, 2010 at 10:37 AM, Nick Coghlan ncogh...@gmail.com wrote:
..
*(The first Google hit for ucs2 is the UTF-16/UCS-2 article on
Wikipedia, the first hit for ucs4 is the UTF-32/UCS-4 article)
On Mon, Nov 22, 2010 at 11:13 AM, Nick Coghlan ncogh...@gmail.com wrote:
..
Do you think these articles are helpful for someone learning how to
use chr() and ord() in Python for the first time?
No, that's what the documentation of chr() and ord() is for. For that
use case, it doesn't matter
On Mon, 22 Nov 2010 12:00:14 -0500, Alexander Belopolsky
alexander.belopol...@gmail.com wrote:
I recently updated chr() and ord() documentation and used
narrow/wide terms. I thought USC2/4 proponents objected to that on
the basis that these terms are imprecise.
For reference, a grep in
On Mon, Nov 22, 2010 at 12:30 PM, R. David Murray rdmur...@bitdance.com wrote:
..
For reference, a grep in py3k/Doc reveals that there are currently exactly
23 lines mentioning UCS2 or UCS4 in the docs.
Did you grep for USC-2 and USC-4 as well? I have to admit that my
aversion to these terms
On 11/22/2010 5:48 AM, Stephen J. Turnbull wrote:
I disagree. I do see a problem with UCS-2, because it fails to tell
us that Python implements a large number of features that make it easy
to do a very good job of working with non-BMP data in 16-bit builds of
Yes. As I read the standard,
On Nov 22, 2010, at 2:48 AM, Stephen J. Turnbull wrote:
Raymond Hettinger writes:
Neither UTF-16 nor UCS-2 is exactly correct anyway.
From a standards lawyer point of view, UCS-2 is exactly correct,
You're twisting yourself into definitional knots.
Any explanation we give users needs to
On Nov 22, 2010, at 9:41 AM, Terry Reedy wrote:
On 11/22/2010 5:48 AM, Stephen J. Turnbull wrote:
I disagree. I do see a problem with UCS-2, because it fails to tell
us that Python implements a large number of features that make it easy
to do a very good job of working with non-BMP data
Raymond Hettinger wrote:
Any explanation we give users needs to let them know two things:
* that we cover the entire range of unicode not just BMP
* that sometimes len(chr(i)) is one and sometimes two
The term UCS-2 is a complete communications failure
in that regard. If someone looks up
On Mon, Nov 22, 2010 at 12:41 PM, Terry Reedy tjre...@udel.edu wrote:
..
What Python does might be called USC-2+ or UCS-2e (xtended).
Wow! I am not the only one who can't get the order of letters right
in these acronyms. (I am usually consistent within one sentence,
though.) :-)
On Mon, 22 Nov 2010 12:37:59 -0500, Alexander Belopolsky
alexander.belopol...@gmail.com wrote:
On Mon, Nov 22, 2010 at 12:30 PM, R. David Murray rdmur...@bitdance.com
wrote:
..
For reference, a grep in py3k/Doc reveals that there are currently exactly
23 lines mentioning UCS2 or UCS4 in
Martin v. Löwis writes:
Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
Martin v. Löwis writes:
The term UCS-2 is a character set that can encode only encode 65536
characters; it thus refers to Unicode 1.1. According to the Unicode
Consortium's FAQ, the term UCS-2 should
On Sun, 21 Nov 2010 21:55:12 +0900, Stephen J. Turnbull step...@xemacs.org
wrote:
Martin v. Löwis writes:
Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
Martin v. Löwis writes:
The term UCS-2 is a character set that can encode only encode 65536
characters; it thus
On Nov 21, 2010, at 9:38 AM, R. David Murray wrote:
I'm sorry, but I have to disagree. As a relative unicode ignoramus,
UCS-2 and UCS-4 convey almost no information to me, and the bits I
have heard about them on this list have only confused me.
From the users point of view, it doesn't
I disagree. Python does conform to UTF-16
I'm sure the codecs do. But the Unicode standard doesn't care about
the parts of the process, it cares about what it does as a whole.
Chapter and verse?
Python's internal coding does not conform to UTF-16, and that internal
coding can, under
On Sun, 21 Nov 2010 10:17:57 -0800, Raymond Hettinger
raymond.hettin...@gmail.com wrote:
On Nov 21, 2010, at 9:38 AM, R. David Murray wrote:
I'm sorry, but I have to disagree. As a relative unicode ignoramus,
UCS-2 and UCS-4 convey almost no information to me, and the bits I
have heard
On Fri, Nov 19, 2010 at 4:43 PM, Martin v. Löwis mar...@v.loewis.de wrote:
In my opinion, the question is more what was it not fixed in Python2. I
suppose
that the answer is something ugly like backward compatibility or
historical
reasons :-)
No, there was a deliberate decision to not
Martin v. Löwis writes:
Chapter and verse?
Unicode 5.0, Chapter 3, verse C9:
When a process generates a code unit sequence which purports to be
in a Unicode character encoding form, it shall not emit ill-formed
code sequences.
I think anything called UTF-8 something is likely to
R. David Murray writes:
I'm sorry, but I have to disagree. As a relative unicode ignoramus,
UCS-2 and UCS-4 convey almost no information to me, and the bits I
have heard about them on this list have only confused me.
OK, point taken.
On the other hand, I understand that 'narrow' means
Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
Martin v. Löwis writes:
The term UCS-2 is a character set that can encode only encode 65536
characters; it thus refers to Unicode 1.1. According to the Unicode
Consortium's FAQ, the term UCS-2 should be avoided these days.
So what do
On Sat, Nov 20, 2010 at 4:05 AM, Martin v. Löwis mar...@v.loewis.de wrote:
..
A technical correct description would be to say that Python uses either
16-bit code units or 32-bit code units; for brevity, these can be called
narrow and wide code units.
+1
PEP 261 introduced terms wide
I was recently surprised to learn that chr(i) can produce a string of
length 2 in python 3.x. I suspect that I am not alone finding this
behavior non-obvious given that a mistake in Python manual stating the
contrary survived several releases. [1] Note that I am not arguing
that the change was
On Fri, 19 Nov 2010 11:53:58 -0500
Alexander Belopolsky alexander.belopol...@gmail.com wrote:
Since this feature will be first documented in the
Library Reference in 3.2, I wonder if it will be appropriate to
mention it in What's new in 3.2?
No, since it's not new in 3.2. No need to further
Hi,
On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote:
I was recently surprised to learn that chr(i) can produce a string of
length 2 in python 3.x.
Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in
wide mode (sys.maxunicode == 1114111).
I suspect that
In my opinion, the question is more what was it not fixed in Python2. I
suppose
that the answer is something ugly like backward compatibility or
historical
reasons :-)
No, there was a deliberate decision to not support that, see
http://www.python.org/dev/peps/pep-0261/
There had been a
Victor Stinner wrote:
Hi,
On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote:
I was recently surprised to learn that chr(i) can produce a string of
length 2 in python 3.x.
Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in
wide mode (sys.maxunicode ==
It'S rather common to confuse a transfer encoding with a storage format.
UCS2 and UCS4 refer to code units (the storage format).
Actually, they don't. Instead, they refer to coded character sets,
in W3C terminology: mapping of characters to natural numbers. See
Martin v. Löwis writes:
The term UCS-2 is a character set that can encode only encode 65536
characters; it thus refers to Unicode 1.1. According to the Unicode
Consortium's FAQ, the term UCS-2 should be avoided these days.
So what do you propose we call the Python implementation? You can
74 matches
Mail list logo